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(54) Computer network interface and interface protocot 

(57) A network protocol and interface using direct deposit mess aging provides tow overhead communication 
in a network of multi-user computers. This system uses both sender-provided and receiver-provided 
information to process received messages and to deposit both data and control information directly where 
they are needed: data in memory and control information in conditionally/optionally interrupting a host 
processor. Message processing is separated into data delivery, which bypasses the host processor and 
operating system, and message actions which may or may not require host processor interaction. In this 
protocol, a message includes an indication of the operation desired by the sender, an operand specified by the 
sender and an operand which refers to some Information stored at the receiver. The receiver ensures that the 
desired action is permitted and then, if the action is permitted, performs the action according to both the 
operand specified by the sender and the state of the receiver. The action may be message delivery, wherein 
the operands in the message specify values for use in various addressing modes including direct, indirect 
post-increment and index modes. The action may also be conditionally generating an interrupt, wherein the 
operands are used, in combination with the receiver state, to determine whether a message requires 
immediate or delayed action. The action may also be an operation on a register in the network interface or on 
other information stored at the receiver. The network interface and protocol are intended for use with 
local-area networks. Specializations of this interface and protocol are particularly applicable to asynchronous 
transfer mode (ATM) networks. The network interface includes endpoints which may be nested and 
overlapped, address registers which may be organized into windows which may be nested and overlapped, 
address register protection and integration of exception handling and flow control. 
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2301264 

COMPUTER NETWORK INTERFACE AND INTERFACE PROTOCOL 
This invention relates to computer network interfaces and 

protocols and more particularly to such interfaces and 

protocols for low overhead communication. The invention is 

particularly applicable to asynchronous transfer mode (ATM) 

networks and local-area networks (LANS). 



A communication system is a significant part of any 
modern computer system. " A fundamental characteristic of any 
such communication system is the communication overhead. 
Such overhead determines the kinds of applications that can 
be exploited efficiently. Low overhead communication (low 
latency and low impact on a host as defined below) is 
particularly important in parallel, distributed, or real-time 

computing systems. 

In general, two fundamental properties in communication 
systems contribute to overhead. The first is data delivery 
and the second is message action. Specifically. data 
delivery requires addressing at the receiver and message 
action requires interrupts to invoke message action at the 
receiver. Thus, communication involves both data delivery, 
transferring data from a sender to a receiver, and message 
action, invoking some special action. such as 
synchronization, on arrival of data at the receiver. Message 
action is often requested by interrupting the processor at the 
receiver . 

Com.munication systems can be classified according to the 
division of burden between sender and receiver. In a 
receiver-based system, information controlling data delivery 
and message action is localized to the receiver. In 
contrast, in a sender-based system, the sender plays a more 
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a^.ect role W .peci£yin9 in£cr.«i.n in e.ch .e=s»9e « 
control data delivery and message .otion. The key 
,„,or™ation for data delivery is the destination address of 
data. In systems .sing receiver-based addressing, the 
,„.,ce ^as no direct input on the linal address oi a message: 
3 message identifies a .u»er at th. receiver into which the 
message . -is-stored .t^o.e_i.pl_icit location, e.g. by 
sTguencing a pointer. By contrkst. in..,.syste.s us>n, 
,erder-based addressing, the source specifies an address, 
contained in each message, indicating directly where the 
„.ssage should he stored at the receiver. Keceiver-hased 
.aaressin, involves significant overhead in comparison to 

Hnu,pver sender-based addressing 
sender-based addressing. However, senoe 

raises protection issues. 

The key infoIiiTation for message action, is whether to 
5..erate an interrupt on message arrival. In systems using 
,eceiver-based interrupts, an interrupt is generatea by the 
receiver on the arrival of every message. In systems using 
,encer-based interrupts, the sender specifies, by information 
contained in each message, whether or not an interrupt should 
be generated on the arrival of that message. 

„ost conventional co»;-.unication systems are 

=^ the typical ethernet network using 
receiver-based, such as the lypi^ 

Kn« ot>cn public local area networks 
the internet protocol. Host open, puu 

use this kind of protocol as well. 

several systems use purely sender-based addressing. For 
example, there is a system called "Hamlyn." of which an 
,„P>eme„tation in hardware is discussed in "Hamlyn: » 
interface for Sender-Based co^unication. " by aohn »ilkes m 
^^^^^.^ „P.-0SK-,.-13. Hewlett-Packard Laboratories, 
^ovem^er 1,,.. One difficulty with the published work on 
Hamlyh is that it merely provides a high level design 
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overview, with few implementation specifics. This system 
includes sufficient protection mechanisms for a multi-user 
LAN environment, but the published work does not specialize 
it to any network, only an unspecified "private multicomputer 
interconnect," similar to others for parallel machines. 

Another system similar to Hamlyn is described in 
"Efficient Support for Multicomputing on ATM Networks" by C 
Thekkath et " a 1 . . Technical R pt^ort TR93-04-03. Dept. of 
computer Science and Engineering, Univ. of Washington. 
Seattle, Washington, April 12. 1993. This system as a 
software-based emulation of purely sender-based addressing 
specialized to ATM LANs, designed for distributed system 
applications. Hardware support for sender-based addressing 
is not addressed by Thekkath et al . 
^- One problem with systems__whi_ch—use_purely_ sender-based 
addressing is that in many cases the location at the receiver 
where data is to be p-lac-ed -should -be- dependent on the state 
of the receiver. For example, if incoming messages are to be 
queued at the receiver, the location of the end of the queue 
is dependent on the receiver state. To handle queueing in a 
purely sender-based addressing system, the sender must know 
the state of the receiver. Therefore, the sender must either 
keep track of this location, which becomes difficult when 
there is more than one sender sending to the same receiver, 
or use additional messages to determine the state of the 
receiver, which introduces atomicity issues. Either of these 
options increases both latency and impact on the host 
processors of the sender and receiver. Also, such extensive 
knowledge of the receiver by the sender raises protection 
problems. Similarly. sender-based interrupts also have 
problems. The sender based nature limits the possible 
interrupt operations. For example, if the interrupt status 
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is only a function of the state maintained by the sender, it 
beconies difficult to priority schedule interrupts at the 
receiver. 

To overcome some of the communication overhead problems 
with either purely sender-based = or purely receiver-based 
coiroiunication, many new parallel machines use variations of 
both sender-based and receiver-based addressing. For 
"example. the~Meiko CS-2. of Waltham'. Massachusetts, supports 
both traditional send/receive communication using 
receiver-based addressing and a remote read/write model using 
sender-based addressing. To support bulk data transfer, the 
CS-2. like many machines, has a co-processor for 
demultiplexing and D^iA to memory. This DMA usually has 
scatter/gather capability, though typically only with 
constant "stride, and thus falls short of sender-based 
addressing. The sender-based and receiver-based addressing 
modes are combined mutually exclusively. This mutual 
exclusive combination is also true of other new parallel 
machines which use sender-based and receiver-based 
addressing. In the MIT Alewife machine and the Stanford 
FLASH machine, cache blocks for shared memory traffic use 
sender-based addressing, while bulk data transfers and 
send-receive traffic use receiver-based addressing. 
Generally, sender-based addressing is used for random access 
communication and receiver-based addressing is used for 
protected, cross-domain communication. 

Some similar work which combines both sender-based and 
receiver-based addressing is a system which is described in 
••Active Messages: A mechanism for Integrated Communication 
and Computation" in mr • 1 Symposium of Computer Architecture , 
pp. 256-266, May 1992, by T. von Eicken et al . In that 
systerr, the sender attaches to each message an address in the 
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,ecewer of an interrupt handler which is invoKed upon 
message delivery to extract the message from the netvorK and 
deposit or process the message as desired. The message may 
,lso include other arga^ents. The interrupt handlers are 
constrained in length and action and are executed using the 
.est processor in the receiver address space so -that they run 
without the cost of a context switch to a new thread. A 
.ajor problem with this system is that it is restricted, at 
least without hardware support, to single user applicataons . 
because context switching is otherwise required. An 
additional problem with this system is that it also treats 
all messages at the receiver as requiring an interrupt and 
thus does not reduce the impact on the processor. 

By way of further background. William Dally presents in 
. pally. ^-^^ -et a-l . — ^-r-c-hit-ecture—of..^ Message-Driven 
..ocessor", in Tn-.rn.rinnal S.^pn^inm on Computer 
^.h.tecture. 198-7; Da-l-ly. -W...a . .et_ al . . "The Message Driven 
Processor: An Integrated Multicomputer Processing Element". 



Comput< 

Massachusetts, October U-l*. ^-6 Dally, W-J- « >^ ■ ■ 

-Fine-G.ain Concurrent Computing", in Vo^r.rrt. directions .n 
..ie.ce: AT perspective, edited by Albert «eyer, 
„,T press, isn, a n,essage driven processor system, in which 
a non-conventional host processor executes conununication 
actions using built-in communication primitives without 
benefit of a separate networ. interface. The primary 
disadvantage of this system is this lac. of a separate 
network interface which would otherwise permit the use of 
conventional processors. Protection issues, especially for 
user-to-user level communication, for a multi-user system are 
Uso not addressed. Additionally, this system is not 
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well-adapted to ATM networks because the message format is 

""""'ll a multi-user network environment. typified by a 
local-area network (LAN), communication ' is a global resource 
for which protection must be provided to isolate a user on 
one processor from accidental or malicious interference from 
another user on another processor. -Also, if the nodes are 
>ulti-user as well, protection must also be provided to 
i.ilarly isolate users from each other on the same 
processor. There has been very little work on achieving low 
overhead in this type of multi-user network and multi-user 
processor environment. 



mu 
s 



- TO overcom'e th^se pr^^lems" and limitations with the prior 

art. this invention provides a network protocol and interface 

"for low over hTad- communication in a network of multiuser 

comr^uters based on direct deposit messaging. Direct deposit 

messaging signifies directly demultiplexing messages and 

depositing both data and control information directly where 

they are needed, e.g.. data in memory, and control in 

(conditionally) interrupting the host processor. In such a 

svstem, asynchronous events are controlled by separating 

events by their need for the host processor. Events, such as 

data delivery, which do not require the host processor are 

.andled directly, e.g.. by depositing data directly into user 

memory. Events. chiefly synchronization. that require 

^^rh the host processor are divided into 
interaction with tne no:**. f 

ir..ediate actions that require immediate service and 

delayable actions that are accumulated and processed at some 

<r.r the host processor . thereby turning them 
time convenient for tne nosv f*- 

^^r^ w^th this separation of data and 
into sy-nchronous events. w.tn tnis y 
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control events, resources sufHcient to bypass the non-nost 

needed . . . 

TO obt=in effective separation of these events, the 
,,ceiver processes a message usin, information in the message 
,,c. the sender. The receiver action in response to a 
message aepen.s hoth on this .sender information and 

^ ^ receiver. A message therefore 

information stored at the recei 

X ♦•Ko nnpration desired by the 
includes an indication of the operation 

■ ^necified by the sender and' one 
sender, one or more operands specitiea oy 

A ^hirh refers to some state maintained by the 
or more operands which reters tu 

receiver. The receiver ensures that the desired action is 
permitted and then, if the action is permitted, performs the 
action according to the opera^nd specified hy the sender and 
the state maintained by ffiilreceiver . ^ 

The action performed by the -rec-eiver may -be- message 
aeuvery, wherTin an" operand in the message specifies values 
„se in various addressing modes, such as d.rect, 
indirect, post-increment and inde,, modes, I.ata is either 
„-itten or read from such addresses directly, without host 
processor or operating system intervention -hile maintaining 
multiuser protection. The action may also be conditionally 

. interrupt, -herein an operand is used in 

generating an interrupt. 

•.K rhp receiver state to determine whether a 
combination wath the receiver 

^essase requires i^ediate or delayed action. The action may 
,„„olve special memory locations, called address registers, 
contained in the networ. interface. This networU interface 
especially useful for asynchronous transfer mode <.T«, 

Mch messages are compared of fixed si^e 
networks an which messages 

primitive data units called cells. 

■ .he interface design also supports a multi-cell format in 
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Haffl This enhancement 
and subsequent cells contain purely data. 

provides increased bandwidth. 

„ is also possible to have endpoi.ts which overlap to 
.now controlled sharing between endpcints in a flexible 

^«r,^ci-or«; is lin^ited to a 
manner. .Access to address registers 

contiguous bloc, of registers called a window. As with 
endpoints. address register window, .ay be overlapped and 
nested „ith _other _ address register windows to allow. 

^AA-rocKi reaister windows in a 
controlled sharing between address regisTie 

flexible manner. Address register protection may als6 be 
provided to restrict access of the sender to different 

'''Tulerous other enhancements may be made, including 
integrating exception handling and flow control, paging 
connection- and- endpoint. t.ables.- adding global address 
registers, and using hybrid mapping to reduce .translation 
Took-aside buffer -misses . 

:t is also possible to provide higher level operations 
executable at the receiver as part of an instruction memory 
accessible by an instruction pointer found in the operation 
neM of a message. Thus, the sender may be isolated from 
some Knowledge of the receiver. The full spectrum of 
operation of sender-based addressing and receiver-based 

addressing is thus provided. 

With this system, communication overhead is reduced by 
.Uowing the sender to specify as much as possible about the 
intended action for the message, while still allowing the 
receiver to control message reception for protection and 

^^/^T^c Thus, both sender and 
receiver-dependent operations. Thus. 

^eceiver information is used to demultiplex messages directly 
.0 Where they are needed, reducing latency. The processor at 
..e receiver is involved only when synchronization is 
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^eguired. That is. interrupts are eliminated for every 
.essace; an interrupt is generated only when a message 
.eguires irr-.ediate action. Thus, impact on the processor .s 
reduced. This combination of control ci asynchronous events 
rnd -direct deposit messaging provides flexibility and reduced 
overhead with both full protection a^d separation of control 
-.nd data. This network interface and protocol is applicable 
-a-^rTss a wide range of networks and- across a wide range of 
applications in parallel. distributed. and real-time 
computing . 

in summary, a network protocol and interface using direct 
deposit messaging provides low overhead communication in a 
network of multi-user computers. This system uses both 
sender-provided and receive^r-provided information to process 
received "n,eslages-""and " to deposit both data and control 
information directly where they are nee_ded: data in memory 
and control information in conditionally/optionally 
interrupting a host processor. Message processing is 
set^arated into data delivery. which bypasses the host 
processor and operating system, and message actions which may 
0^ may not require host processor interaction. In this 
protocol, a message includes an indication of the operation 
ces.red by the sender, an operand specified by the sender and 
an operand which refers to some information stored at the 
-eceiver. The receiver ensures that rhe desired action .s 
permitted and then, if the action is permitted, performs the 
action according to both the operand specified by the sender 
and the state of the receiver. The action may be message 
deiwery. wherein the operands in the message specify values 
for use in various addressing modes including direct. 

. =.r,^ index modes. The action may 
indirect, post-increrr,ent and anoex 

^ir^n an inter^'UDt. wherein the 
also be conditionally generating an interrupt. 
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operands are used, in c=ni>lnation «th the receiver state, to 
deternine whether a message requires in»ediate or delayed 
action. The action may aUo be an operation on . register in 
the network interface or on other infor,..tion stored at the 
receiver. The network interface^ and protocol are intended 
use with local-area networks. ■ Specializations of this 
interface and protocol are particularly applicable to 
.7y„chronous transfer mode ,AT«) ■-networks. The network 
interface includes endpoints which may be nested ^ and 
overlapped, address r*,isters which may be organized into 
windows which may be nested and overlapped, address register 
Ptctection and integration of exception handling and flow 
control . 

- -The invention will^be further deecribed -by-way of non=l imitative 
e^ipLe with referee to the accompanying drawings . in which : - 

.ig 1 is a block diagram of a typical conventional 
conpute. syste. with a communication system using 

receiver-based addressing; 

Fig. 2 is a flow chart describing a conventional 
corr-municatior. process for the computer system of Fig. 1; 

Pic. 3 is a block diagram of a conventional computer 

^^^r^ cvQtPm uslnQ seoder-based 
system with a communication system using 

addressing; 

Pig. 4 is a flow chart describing the operation of the 

computer system of Fig. 3; 

Fig 5 is a block diagram of a computer system with a 
communication system in accordance with the invention using 

direct deposit messaging; 

Pig. 6 is a flow chart describing the operation of the 

computer system of Fig. 5; 



lO 
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7 is a block aiairan. describing . suitable format 
,or a 53 byte ATH cel. to be .sed in one e»bodi.ent of this 

invention; ^ 
Pi, 8 is a bloc, diagran, describing one e»bod.*ent o. 

-He netuor. interface in the co™..nications system of the 

„.puter syste. of ri^- 5 for use in .T« networks with the 

I 

cell format of 

Pig , is a "blo^k dia3r_..:o£ «ie .receWe side operation 
,o,ic of ri,. 8 Showing protection and addres. generation 

portions; . 
rig. 10 is a blocK diagram of an example 

-he -eceive controller shown in Fig. 6; 

.ig n is a block diagram of a direct cache interface: 
r:g. 12 is a flowchart describing how an endpoint and 
7onn'ection based coo^unication is established; and 

Pic 13 is a block diagram of the receive side operation 

logic of Fig. 8 in a second e^^odiment of this invention. 



The present invention will be more completely understood 
.hrough the following detailed description which should .e 
-ead in conjunction with the attached drawing in which 
"similar reference n^bers indicate similar structures. .11 
.eferences, including publications and patent applications^ 

v,«r-t.hv exr>ressly incorporated 
cited above and following are hereby express y 

by reference. 

«^.<:•cem is shown in Fig. 1. 
A typical computer system is 

.... of illustration, a first computer 
includes, for purposes of ^^^^^.er 
..lied the sender) 50 and a second computer 
(hereinafter called tne 

• S2 It should be 

^>or called the receiver) 52. 
(hereinafter caiiea 

. K ► .he reference to sender and receiver are used 
understood that the reference 

, .or ease of illustration. The sender 50 and the 

merely for case 



II 
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52 .re in.erconn«t.d by e network 6. via respective 
interfaces « anS B6. Although Fi,. 1 shews two 
^."-e-s <i.e., sender 50 and receiver 52), the system is 
„ computers. There ™=y be multiple senders 
: "„\e receiver. .uUiple receivers and one sender or 
■.:..,;e senders and receivers. AUo. each such computer ..y 
comprise interconnected processors. . FinaUy, the sender 
;;,e=eiver „.y be interconnected pVocessors within a single 

Thp term nodG as 
^-puter such as one parallel coir.puter . The 

«r rpreiver. Each node is 
se-- hereir. signifies any sender or receiver 

to have its owr. virtual address space (i.e.. the 
...s a-e ass-u.ed to have virtual n>enory) distinct ana 
Z.^.^^e-' fro. that in oth er nodes. Th e network .s 
ZZ..^^^tty r^um^e-^bn-cooperatinc users an^ each 
,e used by n^ore than user. Thus, protection 
are t.-oicaily provided to protect against 
cr malicious interference with a process of one 
... a-~ther user. For example, there are mechanisms to 
,/e-e-- one user fro. 1) unauthorized sending of messages to 
,,3er. 2) unauthorized access to ne:.ory of another 
"■'JZ trying to appear as another user to a receiver. 

Ijse nechan.s.s are co^only found in standard extensible 
^.etwcrKs interconnecting multiple users and are often not 
^ou-c in closed, proprietary networks. 

" '..e sender 50 includes a processor 5. connected to the 
.e-worK interface 8. and programed according to a desired 
::.atinc syste..- illustrated at 56. The operating system .e 
Z'\ co:.outer progra. which manages node resources such as 
processor time and network access, arbitrates and 

*■ w and controls 

*^^r^c '■-on^ each otnei. . e^*-* 
e--s applications ^^o^ 

:::;.;;ticn between appucaticns. scch as applications 5» and 
::.:.ic . and the processor 5< . The operating syste. 

12. 



.ssocU«a t.erevi-.h ..s»,e buffers .2 and » wMc. are 
„3e- .o sen. »essa,es across the network interface 8. tP the 
„.;ver computer S.. Each appUcaUon SS. eo .as 
therewith respective .e»ory portions .6 
.a. he caale. enapoint buffers or si.ply endpoints . 
" Sin,ilariy. receiver computer S. ' incudes a processor 

r^^^ RA The processor 70 is 
connected to the network interface 66. P 

p,ocra™,ed according to a desired operating syste. as 
L:strated in ri,. i. Similar to the sender, the operat.n, 
.,.te» and receiver includes .essage buffers 74 an 
,1 .ppncations and BO also have associated therewith 
portions and ei, which also .ay be caUed endpoint 

buffers or simply endpoints. 

■ I- general the sender or receiver .ay include other 
.-ocessors such.„as a networ_R .c-processor or other 
"^-processors. To remove an^iguity, processors. 5, and 70 are 
re'erred to as "host processors" herein. 

" . „essa,e <sho» at a3, in a conventional syste. includes 
, ,,,aer Which is used by the network S, and the network 
i..erfaces e,, se to direct the message to the appropriate 
node and endpoint. 

■ ^ application, such as 5S, at the sender .0 and receiver 
c=.-,unicate by sendin, messages to each other across the 
network 8.. The .essage .ay include any data, including a 

^r, Fia 2, conununication 
^ ^« r-All As shovm m Fig- 
procedure can. , ^ 

conventionally involves at least the following steps, r.rst^ 
,ne application 58 invokes a send command in step 9.. 
.„.ple way for an application .8 to invoke a send operation 
,3 bv executing a co™.and as depicted by the "send- co,.an 
\ne co^and includes an an identifier U.) which iS used 
..t to identify the receiver, a source address fro. whic 
:;s.e data will be taken and a si.e. indicating the a.ou,.. 
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of data to be sent. The operating system then copies the 
message in step 100 from application memory, such as an 
encpoint 66, to message buffers, e.g.. 62. in the operating 
systen 56 of sender 50. This step is often optimized by 
dipping locations~in the application memory to locations in 
the message buffer 62 to avoid actual copying. The operating 
system~56 theTpeTf^rms Protocol processing if necessary in 
step 102. That "is. the data" to" be ' sent is placed into the 
proper format as may be required by the network 82. ^ The 
message, or perhaps several messages if the amount of data is 
large, is then injected in step 104 into the network B2 
through network interface 84. 

Usually, message arrival at the receiver causes an 
interrupt to the processor 70, and the operating system 72 
directly extracts the message in step 106 from the network 
(neaninc netwprk and network interface) and copies it into 
:,essage buffer 74 in the operating system. Alternatively, in 
a system with appropriate hardware, the operating system may 
set up a direct memory access (DMA) with the network 
interface to extract and copy the message to message buffer 

74 . 

in some communication schemes an intelligent network 
interface, or perhaps a second processor at the receiver 52 
(e.g., the communication co-processor in the Intel Paragon) 
extracts the message from the network 82 and copies the 
message to the message buffer 74. The operating system 72 
then performs protocol processing, if necessary, in step 108. 
then copies the message in step HO from the message buffer 
72 at the receiver 52 to application memory such as endpoint 
79 usually the operating system 72 copies this data to 
acolication memory in response to an explicit receive request 
by' an application, such as a "receive- command 77. The 
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cc.,..d includes an ID- v.ich is use. in p»r. « specify «e . 

„e=sa,e .offer, e.,., 7,. fr» w.ich data is .o .e received. 

,,e receive address rc .hich the «ss„e snouid be ccp.ed. 

„a a si.e indicating the amount ■ of data. It is also 

.ossih.e for the operating system to auto^aticaUy copy 

Le data according to so.e previously saved state 

information. .s in the sending., ide^. this copying is often 

cp.i.ized hy mapping locations in the applic.tion_n,e.ory to 

Knffpr 74 to avoid actual copying, 
locations in message buffer 74 to , 

. co-unication syste. has overhead vhich includes hoth 
communication latency and impact on the processors 5* and ,0 

.ne sender .0 and receiver 52. For sa«e of simplicity, 
communication latency and communication overhead are referred 

, latency and overhead. Latency may be 

to herein merely as latency^ a**^ 

aef'inea as an amount of time taKen for a message to be 
.-ansfer.ed from application memory at the sender 50, such as 
endpoint se. t; application memory at the receiver 5.. such 
as endpoint 7,. Impact on the processor involves interrupt 
handling. data flow control and protocol processing, 
overhead is reduced by optimising the steps described above 

^rh FiQ 2 so as to reduce latency and in^pact 
in connection with Fig- so 

on the host processor. 

fnr aoDlications which require 
LOW overhead is important for appiicat 

rapid response behavior, such as parallel and distributed 
computing systems and real-time control systems. In paralle 
computing systems, low latency is essential to reduce th 

process at a sender 50 waits for data to be 
amount of time a process „-ucation memory 

-ead from a remote memory location, e.g.. appl 
„, for remote synchronisation operations <e.g^^ 

• « locks) to be completed. m 
obtain^no and releasing locks) 

r. uied computing systems, performance of a client serve 

. lirr^^ed by the amount of time re<^ired .0 do a 
r.cdel is often lim^^ec uy 



IS 
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remote procedure call (RPC), which is affected by latency. 
The importance of low latency is perhaps mosr obvious in 
real-tinie systems where an inordinate delay in corr-T.unicat ing 
a control input may lead to disaster. 

Even when low latency is not essential for a given 
application, it may increase the- spectrum of possible 
applications and the flexibility in structuring a system. In 
parallel computing systems, lower, latency enables the 
efficient exploitation of more finely grained computations 
and thus increased parallelism. In a distributed computing 
system, sufficiently low latency may make paging, e.g. by 
sender 50. over a network 82 to memory in a remote node. e.g. 
receiver 52. faster than paging to a local disk (not show-n) . 
Finally. low 1 atency -could help make ~clTent-server " based 
ccT.puting systems attractive for realizing flexible real-time 
corTiputing systems. — 

Lov impact on the host processor is important to minimize 
the decradation on applications due to reduced and 
unpredictable processor availability. Predictability is 
particularly important for real-time tasks perforned by the 
host processor. It is important to insulate such 
applications from unrelated asynchronous network events. 

Current generation parallel computing systems with 
proprietary networks obtain latencies in the range of Iwsec 
to lOOpsec It is desirable to have latency in a 
local-area network be no more than 1000 cycles, which for 
future iGHz processors, is Ivjsec. In conventional 10Mbps 
Ethernet LW^s. latency is typically about 1msec. First 
generation 100 Mbps ATM networks can achieve about 250-,^sec 
late.-icy using conventional network protocols and interfaces. 
Because increasing the speed of a network does not 
necessarily reduce latency. to achieve lower latencies 
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..,ove.en.s are nee.e. in .he ope.a.in, syste.. .he networK 
interface and the network protocol. 

conventional approach will now .e described in .ore 
,e^3il in order to identify the obstacles to o.tainin, low 

• . conventional conununication systen>s 
overhead communication . _ Conventio 

° . _ 'TCP/IP implementations) 

in both distributed systems (e.g.. TCP/IP 

K^n^s (eq" the Intel Paragon) are 
parallel computing machines (e.g.. 
and paraiiex f systems, 
oriented towards bulK _an_d stream data. In 

, in size include a buffer identifier 

n^essages. often large in size, inci 

1 bu»e. ID .n.o .e^an.Ul posiUons in the ..ent.f.e. 
message buffer, e.g.. 74 in Fig. 1. 

• :n ai«s. ail cases, .he ne.wc. interface generates an 
,-«rrept to the host processor. The operating^ syste. then 
e;ther airect.y copies eata fro. the netvor. interface to a 
,»ssage huffer within the operating syste. or sets up 
transfer which accomplishes this n,essage copy. 

huffer 74 is within the operating 
Because this message buffer 

^ ..-lier the operating system 72 at 
cvs-ea 72 as described ea.lier. 

eceiver .2 must copy the data from the message buffer 74 
„ appucation memory, such as endpoint 7,. one way to 
Uni'n te the copying overhead of such buffering is to ap 
application memory of endpoint n to the message buffer 
Z lernatively, the data could be copied directly to 

^^^^ network interface. 
at)olication memory from the net 
■".egardless of the implementation method, message 
,..„y ohly demultiple«ed to se^ential locations m eit e 

buf'e^s in the operating system or m the 
message buf.e s ^^^^^^^^ ^^^^ ^^^^^^^ 

..vacation - - ^^^^^ ^ ..ed within 

interface): the pos.. ion 

i^ determined implicitly. e.g.. 
a nessage buffer is ciete 



n 



„ce.ve. causes an P ^^^^^^^ 

— " -^"^ """1 1 nn./aa.ess o. a .essa,e. 

" receiver on receiving .he 

the interrupt status o is appropriately 

„essa,e. tKis .or» 0^ .essa,e .an.Un, 

-■-^ '::r:Lentiona: net-or. protocol to 

is co^on v.t. suo- on a sin,le 

,..tiple« — -;:^jr^,L„, i„tro.uces si.i.cant 
processor. However . th.s P ^^^^^^^ „ 

..e:.ea., because .any eye ^^^^^^ ^^^^ 

interrupt. interrupt. FinaUy, an 

cycles are usea to ae«Ui ^^^^^^^^ 
^ w-r^rne- is executed. •i"-^'' 

e'lay in .essa,e aeUvery ana temporarily 
both incurs aelay receiver. The 

postpones application ---^ .use. .y 

„„Pte<iic=able nature oi app ^ ^^^^^^^ 

asynchronous message arriva ^^^^ application 

.yste.s. rre^ent interrupts also 

,,..„t.ance. „,,,,„„p„„rs such as the Intel iPSC 

Maress.n, in ear y ^^^^^ ^^^^^^ 

„„a such an approach, t^ ^ ^^^^^^^ 
receiver ae»ultiplexea .essa, 
application ae»ultiplexea the huH 

..ory. .ernel involvement remains^^st^^^ ^^^^^^^^^ 
parallel workstation area. 

:„«rnational Business Machines. ^^^^ ^^^^ ,„„.entional 

„ alternative solution p^cessor as a 

is to ada another Juii P 
coa,unication protocols ^^^^^ 

=--rrar::::r:s.chronous.essa. arrivals. Such 

whicn IS use-" 
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1. also useful for handling 
3 comunication co-proMSSot is a.so 

complicate. ,ather-scat«r operations, whicn arise when large 
messages are used. since t^is co-processor auplicates 

hardvare, it is expensive. 

instead c£ devoting significant resources and complexity 
the receiver, liKe the co-processor approach, to determine 
„«,e to deposit and ho- to handle messages the sender can 

c..,.>, a. knovm, though less 
do this determination. Such a Known, 

• »i .r,T.roach for a cotmunication system is shown in 
conventional approacn lot « ^-^ , 

.ne .loc. diagram in Fig. 3. In this system, a sender 87 and 
receiver B8 are connected by netuork 82 via net-or«c 
interfaces 69 and 90. The respective operating systems 91 
„a S2 need not have message buffers. In this system, the 

1 1 he deposited in the 
sender specifies where a_message_will_be_ oep 

receiver. As^ndicTted by the example command 9S in Fig. 3. 

-.e -send- command no- also indicates -here the message is to 

deposited in the receiver, i.e. , by including the -receive 

,adress-. Thus, a message such as 9, sent from the sender 8, 

.0 the receiver 88 includes not only header and data 

^Yia rpceiver 88 into 
information but also the address at .he recei 

„nich the message ,4 vill be placed. The sender may also 
.irectly specify Whether an interrupt is to be generated upon 
messace arrival at the receiver. 

^5. , is a flow chart describing the general operation 

, First the sender determines an 

of the system of Fig. 3. First. 

^c: rn be deposited at the 
address where the message is to be P 

receiver. The sender invokes the send command in step 

„i,, ,nis address. The message is then injected into the 

networv in step U. directly from the application memory, in 

contrast to the conventional system shown in Fig. = which 

,,ects a message from an operating system level message 

Kext, the message is then extracted in step 113 ano 



. ..e ne«o.. interface, ^he .ess.^e is ee.uU.pUx a 
" . .he .eceive address in .he .essa^e (step . 

.0 .he convenaonax -e.-c. s^s.e. 
e..i„e.s.e..eve..essa,ehu«e.. 

in this system, copying _ .,,ffers (steps 

. .o and from operation system message buffers 

. Pi. .) is omitted! Kat.er.„a message .s 
100 and no of Fag. 2) ^^^^^ 

•Pd directly from application n,emory m 
copied direct y ^^.errupts to ^ the 

application memory m the recei 

or may not be indicated by the message. The 
processor may or may specification by 

^ is solely dependent on the specm 
interrupt state is solely P 

.hP sender. Various realizations of such a sy 

;:esence, a„.n., an. de.ai.s . .o.ec.ion. as in.ca.e. 

r .o-;.c. a-. in.e.ace has .een .se. 

K =c the Tera 3D 
. parallel machines. such as 

Pr.r-.a-.-> Minnesota, and the 

-va-iable from Cray Research, Inc . . or 

.nive.si.. .achine. Such a pa.aUe. .ach. 

: : . ,:c.a: address space uses a .essa.e, .suaUv s.a n 
,1 vcrd cache h.oO. which carries .o.h an 

,a data The da.a is stored directly in the 
sddress and data. „ethod, 
-.ver address contained in the .essa,.. - ^^^^^^ 

der,ultiple"n9 at the 

specifies an the information. Conse,.aentiy. t 

wdlin, is caned sender-based addressing, 
message handling is aescribe 
.naiogous term -sender-based interrupts 



.ne interrupt „.^ed to ha.e 

Because such a para.l ,„„„„ing benevolently. 

,.,c'e user, or nultiple users inter.cti g 

' issues are often not addressed in their internal. 

:reis ...ic. - ---^-'^ 

netwcr.< systems protection 9T . 

■ . ^--allel processors. The. is. P 
interconnecting po.aiiea y 

• 3 are often omitted. 
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This invention overcomes problems in tne prior art by 
directly depositing messages where they are required. Fig. 5 
is a block diagram, of a corvT.unication systen in accordance 
with the invention. It includes a sender corr.puter 270 and 
receiver cor.puter 272 interconnected by a network 62 via 
network interfaces -274 -and -276- .. The sender and receiver 
conputers 270, 272 each have respective- processors 54 and 70 
programmed according to a desired Operating system 278 and 
280. For the purposes of illustration, the terms endpoint 
and connection will be used. These terms should not be 
construed to limit the invention. as the invention is 
applicable to many types of computer networks. As used 
herein, an er.cpoint signifies a contiguous region of virtual 
_^g_3,y,_._A connection is a virtual channel authorizing 
co-.T,unicction between a pair of endpoints. A connection r.cy 
also be multi-cast, i.e., not restricted to pairs of 
er.cpoints. Applications 58 and 60 being executed on the 
sender 270 and applications 78 and 80 being executed on 
receiver 272 each have one or more endpoints assigned to 
theri. e.g. respectively 66, 68. 79 and Bl. Each of the 
cperctmg systems 278 and 280 store connection state and 
r.appmg information 282 and 284 respectively, which are 
indicative of the states of the sender 270 and receiver 272. 
This information preferably is cached in the network 
interfaces 274 and 276 as indicated at 286 and 288. 

A message 290 sent from the sender 270 to the receiver 
272 includes both control information and data. The control 
information may include an indication of an action to be 
performed, one or more operands indicative of a state of the 
sender and one or rr.ore references to information stored at 
the rece:ver. This information stored at the receiver may 
also be called state maintained by the receiver or receiver 



stcte . 

2.1 



Fig. 6 is . flow chart desczibin, senecally the operation 
syste. shovn in ri,. 5. The sender first seneretes 
seoarate control information and data and constructs a 
„e"ssa9e in step 115. As indicated by the e«.ple connand xol 
in rig. 5. the "send- con»and includes an identifier of a 
cornection. an address from which' data is to be taken, 
control information and ah indication_of the amount of data 
to be sent. 

This message is injected directly fro. application mernory 
into the network in step 112. Next, the message is extracted 
fror. the network into the network interface in step 113. In 
step 116, the network interface demultiplexes the message, 
depositing data directly into memory and/or conditionally 
delivering interrupts to the processor. 

A message originating at a source endpoint bypasses the 
operating system and host processor at the sender. At the 
receiver, the operand may be compared to connection state 
information to determine whether an interrupt should 
conditionally be delivered to the processor 70 and the 
receiver 272. Also, these operands may be combined with 
connection, state and mapping information to determine the 
address in memory in 79. for exam.ple. to deposit the message 



data . 



Thus, in this invention, copying of message data from 

operating system message buffers to application memory is 

omitted. Furthermore, interrupts are conditionally generated 

according to both sender state and receiver state. Also, an 

J, 4-, /ipnosited is determined in 
address into which message data as depositeo 

part by sender state and in part by receiver state. Thus, 
.he sender may be isolated from too much knowledge of the 



receive: 
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This system is generally based on the observation that 
message sending is simple, whereas message reception is 
complex because of the asynchronous nature of message 
receiving. Message handling at a receiver should therefore 
be separated into- message delivery and message action. This 
separation allows control of asynchronous events, wherein 
events are, distinguished^ by their need for the host processor 
at the receiver.- Events, -such-as .me'ssage ^delivery, which do 
not require the host processor are handled directly. Other 
events, chiefly synchronization, that require interaction 
with the host processor are further divided into actions that 
require immediate service and actions that can be delayed and 
acca'nuiated and processed at some time convenient for the 
host processor, thereby turning them into synchronous 
events. With this separation of data""and control events, 
resources sufficient to bypass the non-host processor events, 
rather than all events, is all that is required. 

Message delivery simply involves depositing a message in 
a desired location in the memory of the receiver, e.g., by a 
remote write or a direct memory access (DMA). Message action 
is taking some action in response to reception of a .message, 
such as returning a value, performing a read operation, 
notifying a task that the data has arrived, enabling a task 
on the scheduler queue, which is a data structure indicating 
tasks eligible to be executed by the processor, or invoking 
an arbitrary interrupt handler, e.g., a remote procedure call 
(RPC). 

Whether an action is immediate or delayed depends on 1) 
when a remote process awaiting the result of the action (if 
any), hereafter called the waiting task (not always the 
sender), needs the response, and 2) the priority of the 
action relative to the priority of other activities at the 
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receiver. Some examples of immediate actions are a reaa or 
synchronization operation where the waiting task needs the 
result to proceed, and a high priority control operation, 
such as some operating system action. Some examples of 
common delayed, actions are notifying a task that data has 
arrived and enabling a task on a, scheduler queue. The 
related message may also be referred to as not requiring 

irrcnediate action. 

In many cases, a system may be structured so that an 
immediate response is not necessary. For example, a retnote 
node may execute another task while it awaits a response from 
a message action. Of course, if a message is destined for a 
waiting task which is not currently active anyway, e.g.. 
notifying an inactive process that data has arrived, any 
action in response to that message may- be- delayed . When an 
action may be delayed, the message may be queued for 
processing at a later, more convenient time for the 
receiver. Thus, a message for which an action may be delayed 
becomes a synchronous event. Conveniently, queuing a message 
is merely message sending and a queue pointer update. 

To implement this kind of protocol, herein called direct 
deposit messaging, a message contains the connection, which 
is an identifier which implicitly identifies the receiver 
endpoint, control information indicating both an action to be 
performed and one or more operands, and data. Each operand 
„,ay be an address to be used by the receiver, and/or may be a 
parameter used to determine whether the specified action must 
be performed immediately or may be delayed, and/or may name 
some receiver state. Addresses are encoded as an offset from 
the base of the endpoint. The offser is essentially a 
network logical address that insulates the sender and 
receiver from the addressing details, e.g. address space 
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size, virtual to physical mappings, and page size, at the 
other. This separation promotes modularity and accommodates 
node heterogenity. Furthermore, an offset typically does not 
need the full dynamic range of a virtual or physical address 
and" thus can be~enc"oded in fewer bits within a message. 

A set of primitive actions', representing cotrjnon 
oper-at ions -that may be implemented simply without host 
"Accessor or "operatinF"' systeirTntervention . is provided. 
More complex actions are left to the host processor. The 
primitive actions described herein are simple data transfer, 
i.e., read and write to endpoint locations and conditional 
interrupts to the host processor for delayable or immediate 
actions. 

The simplest operations are pure sender-based direct read 
and write data transfers. For a direct write, the sender 
specifies the source data by its offset from the base of the 
source endpoint and the receiver location by the offset from 
the base of the receiver endpoint. Messages contain the 
receiver offset and the data. For reads, the source sends a 
message with a direct write request to the receiver, along 
with the offset in the receiver and the deposit offset (for 
the reply) in the source, and an indication of a reply 
connection if the connection is not duplex. 

To enable actions vhich are a function of both sender and 
receiver state, the receiver end of each connection maintains 
some state. i.e., stores some information which message 
operands may name and so obtain receiver addresses. To 
simplify matters. this state is contained in specially 
addressable locations which herein are called "address 
registers". This state could also be held in general memory 
locations. Thus, message actions are a function of an 
operation specified by the sender, operands representing 
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sender sta 
address registers 



„. .ne receiver state, such as the contents oi the 



.he f.nowin, is an example set of primitive operations 
using sender and receiver state. 

t 

I 

1 Direct addressing: effaddr - operand 

2 " indirect addressing: effaddr = <'adcareg,> 

3". indexed addressing: effaddr = «addreg,> . operand> 

^eister npprations: 

1 . addreg^ * operand 

2. addreg. ^ unary-op oddreg^) 

, addreg.^<addreg.> bina^-op^addregj> _ 

where i and j are not necessarily different 

rnnditior '-' npprations: 

1. if (<addregi> compare-op operand) 
then generate interrupt at end 

2. if (<addreg.> compare-op <addregj>) 
then generate interrupt at end 

vnere\ and j are not necessarily different 

3..e form o. address generation unit calculates an effective 
^^^^^^^ .hich to read or .rite data. <X 

denotes the contents of .emory location X. "Operand may .e 

^ in a message. The message operation 
data or other operand m a message 

,,.„.,«s selected, and their order. The cond.t.onai e 
!.a. occur at an. t..e .ut the interrupt pre.erah:. occu.s 
the end of the compound operation. 
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These primitive operations allow a rich set of powerful 
and flexible compound operations. For example, an indirect 
write with postincrement can be ' synthesized with an 
indirection followed by a register operation: 

<addreg^> WSG 

addreg^ <addreg^> + operand 

The last step may also be: "add?^gT^addreg7-»-<addregj>^ 

Done on a per-cell basis. this compound operation is 
equivalent to DMA with stride equal to the increment value. 
However, note that varying operand or <addreg.> yields 

variable strides. 

AS another example, priority gueueing and interrupts can 

be syTrtheriTe"d--as^fo-l-l-ows-: — - - 

<acdre5; > -^-KSG 

addregp ^ oddreg^) + <addregg> 

if (operand greater than <addreg.>) generate 

interrupt at end 
adcreg^ *- <addregi> bitwise-or operand 

where -operand" indicates the priority of the message, 
addregp points to the end of the queue to which a message 
of this priority should be added, addreg^ contains the size 
of MSG, and addreg^ holds the priority level at the 
receiver where p. i and s are different from each other. The 
nessage specifies the operand and register indices p. i. and 

Co..ole:.: compound operations like this priority cueueing 
.ay recaire multiple compound operations. For example, two 
compound operation messages would be required in this case .f 
the receiver executes one register operation per message. 

2.1 



As e final example, someciraes it may be convenient to 
c=?end -essaces to one of several different queues without 
generating interrupts and maintain a bit vector of non-empty 
queues. T.-.:s mechanism can be implemented in generally the 
sa:r.e way as priority-based interrupts, but with the most 
significant bit of addreg^ set to 'block interrup'ts . This 
implementation assumes both unsigned comparison and fewer 
queues than bits in addreg^. This ' second assumption could 
be relaxed by using multiple address registers. Messages^ may 
also be appended to a. queue within a specialized endpoint 
within the operating system. enabling delayed actions 
involving the operating system. 

As ot.her examples, various atomic operations such 
fetch-end-increment ^rerd-modify-write, and compare-and-svap 
can be implemented by devoting one or more of the address 
registers for the- target location and using the register 
creraticns for incrementing and comparing. Barrier 
s-jT.c.'-.rcnization can also be implemented this way. When a 
process reaches a barrier point, it toggles a bit in a 
specified address register of all processes in the "barrier 
set" and then waits for a conditional interrupt when all the 
tits are set (or cleared). 

This protocol provides at least the following benefits. 
First, the combination of sender state, receiver state, and 
operations on the two is very powerful. The full superset of 
capabilities of both conventional, receiver-based addressing, 
and sender-based addressing is possible. The mix of sender 
ar.d receiver information can be varied on a per message basis 
tc acco-T.odate different requi reme.it s on what the sender 
knows. or alternatively, different requirements on the 
isolation of knowledge between sender a.nd receiver. 
Indirection provides protection by isolating the sender from 
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too much knowledge of the receiver because the actual storage 
address is partially a function of an address the sender may 
not know. Similarly, the interrupt status is partially a 
function of receiver information contributed (via messages) 
by processes that the sender may hot know exist. As will be 
described below, a mechanism may also be provided for 
register protection that prevents a-sender from accessing by 
read or write or otherwise determining or modifying the 
contents of specified receiver state. such as address 
registers . 

Second, predictability of computation, which is important 
in real-time systems, is increased by restricting control 
flow interrupts to well-defined points. In effect, the 
asynchronism and nondeterminism is eliminated from 
asynchronous network events. 

Third, action handling is more efficient and results in 
less overhead because interrupt overhead is amortized over 
multiple actions. Polling can be used to synthesize hybrid 
inrerrupt-polling action methods. 

Finally, more complex operations can be formed by 
cc-bining the result of multiple compound message operations. 

This protocol is preferably endpoint and connection 
based. Endpoints and connections are allocated and 
deallocated with kernel calls. Preferably, endpoints are 
page-aligned. Thus, host virtual memory page protection is 
also used within e.ndpoints. A connection may be established 
between any pair of endpoints. including endpoints on the 
same node. The connection establishment protocol is much 
like session establishment in Berkeley UNIX sockets. Some 
out of band mechanism, such as a boot-time agreed upon kernel 
endpoint and connection, is used to arrange allocation of the 
er.docint and connection in the receiver. 
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An endpoint may have multiple originating connecnonS 
anc/or irultiple terminating connections. Connections can be 
simplex, duplex, or multicast. Connections originating from 
or terminating on an endpoint all .share the same mapping 
information. However endpoints ckn be overlapped or nested 
to form more complicated protection ' patterns . For example, 
connection A could create an endpoint with virtual address 
bounds (vj^, v^). Then connection B could create a 
second endpoint that is a proper subset of this range to 
allow connection A access to all of B. but B to only access a 
portion of A. Or. connection B could create an endpoint that 
partially overlaps with A's (v^. v^) range to allow 
connection A and B a limited range in which to share without 
exposing their entire respective endpoints to the other. 
Different protect i on_schemesv- can also be -realized by mapping 
the physical pages of an endpoint to virtual address ranges 
with different page protection. 

Network protection is provided as follows. Access to the 
network via out-going connections is controlled by 
per-ccnnection state maintained by the kernel. Messages 
arriving from the network check for authorization with the 
receiver connection state maintained by the receiver kernel. 
Authorization to receive a message from an incoming 
connection implicitly authorizes the message to write in the 
associated endpoint. However, the receiver address must 
still map to a legitimate endpoint address and the operation 
must be permitted by per-page access rights. 

Protection can also be provided in such a system by 
having the receiver verify that the operation requested and 
the address used at the receiver are legitimate. For 
exar.?le, a memory region can be specified for each 
application. If the address to which data is written is not 
in tr.e specified region, access is denied. 

30 



An implementation of this system, specialized to Atk 
networks, will now be described. It should be understood 
that the following is just an example, and that the invention 
nay be implemented for other networks and in different ways. 

Fic. 7 shows one possible format of a 53 byte ATM cell 
120 for this implementation. In the simplest format, a cell 
includes data and control information along with the standard 
network header and other information, The sizes of the data 
fields in this format are merely exemplary and are not 
intended to be limiting. 

In Fig. 7, an ATM header field 132 contains link routing, 
which indirectly identifies the receiver, and traffic control 
information. This field is five bytes and is in a format 
suitable for processing by a standard ATM switch. The 
connection numbe_r^ is — encoded in a virtual channel/virtual 
path identifier (VCI/VPI) field (not shown) in this header, 

There are also a field of 32:. bytes of data 122 and 16 
bytes of control data (discussed below) per cell 120. The 
data field size matches memory and cache block sizes of 32 
bytes and thus enables fast, efficient hardware 
implementations. Data masks can be used to eliminate 
unwanted message data at the receiver, as explained shortly. 

A cyclic redundancy check (CRC) field 124 of two bytes 
may also be provided, to correspond to the data field 122, at 
the end of the cell 120, to help prevent an errant message 
from being interpreted as a valid message. Two other bytes 
are unused. 

The control information includes a four-byte operation 
field 130 which specifies the type of operation to be 
performed at the receiver. This operation field of 130 may 
include a mask field 131 and opcode field 129. The opcode 
field specifies the operation, whereas the m.ask field can be 
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used to deselect the reading or writing of four byte words 
within a block of the data field 122. That is, bit i in the 
mask controls whether data word i is read or written. This 
feature is useful to update a location without changing the 
values (e.g., variables) in neighboring locations in a 
block. A four-byte operand field 126 is also provided. The 
operand is a 32 bit immediate sodrce operand (offset or 
data). Destination operands are specified via three separate 
register indices encoded in a four-byte index field 128. 
These separate index fields are shown at 121, 123 and 125. 
The check field 127 contains a simple check sum over the 
prior control fields so that decoding of the control can 
begin without waiting for the entire cell to arrive. 

For a read request to a remote node, the data field 122 
contains the control field for the reply write message. 
There is also a multiple cell message format for block 
transfer. In this format, the first cell is a "control" cell 
in the write format shown in Fig. 7, and the following cells 
are standard AAL5 (ATM Adaption Layer 5, an ATM signaling 
standard) cells. To avoid complexity with cell boundaries 
and length and CRC in the last AAL5 cell, all block transfers 
are multiples of 16 bytes. 

As mentioned above, the direct deposit model of 
convTiunication herein described is endpoint and 
connection-based: an application allocates endpoints, sets up 
connections between the endpoints, and then sends messages 
over these connections. To support such operation with the 
direct deposit model of communication herein described, the 
operating system at each node maintains the following data 
structures. However, a hardware implementation may cache 
sorr.e cr all of these data structures to support high speed 
operation, as will be described in more detail below in 
connection with Fig. 9. 

32- 



endpoint t.ble includes .n entry for e.ch endpolnr .r 
.... node <e.,., sender 50). indexed by ,n endpoin. n^er. 
..ch entry in this tsble contains indications of a base 
„e.ory address for the endpoint. e.,. in ™e.ory or ei at 
receiver ... endpoint size, virtual to pbysic^ -pp.n, 
i„;or.ation access information, e.g.', private, read only or 
..Sl^area, an open connections to ■the endpoint, and any 
processes attached to the endpoint. 

. connection table includes an entry for each connection 
originatihS or terminating at that node, indexed by the 
connection nu^er. Each entry in this table contains an 
indication of an endpoint nu^er, address register base and 
..unds, connection state information, and reply connection 
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. node address table is also used. Each entry for a node 
;pc':u^es-the na-mT of "a" remote node and the connection nu^er 
a connection (direct or indirect, to the operating system 
the remote node. The index to the table is a uniT- 
clobal identifier for each node. This table is used to 
contact remote nodes for connection set up. This table may 
contain naming information via some alternative 
signalling n«chanism, such as Internet-protocol (IP) 
addresses . 

B,ta are delivered to endpoints and interrupts are 
aelivered to the operating system. Specifically, data are 

^ ^y.^ rprpiver of a specified 
delivered to the endpoint at the receiver 

• n .nd not to processes which may be attached to the 
connection and not to 

. ints inte-rupts are delivered to the operating sys.em 
endpoints. inte.rut^L 

rhe receiver and not to specified processes. The 
running on the receivei -^.^^ 

mav then deliver interrupts to specified 
operating system may tnen 

processes . 
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V . evcrpn will be described an 
• ;.^r^ coruo £ov such a system win 
operation setup toi. 

r trAn "15 The sender 
t^^rh the flowchart of Fig. 12. 
connection with tne ij-^ 

3ncc»es an en.point in vU.ual a.a.ess space .n 

3;;p .00, . .ree s>o. U fcun. or .a.e in the endpoln. taMe 

is fine. With the .ase ad.ress of the enapoinc, ana 

virtual to physical mpping information. 

an alternate connection, perhaps a dea^ca.ed 

vsr>n or an alternate network like a 
operating system connection or an a . 

r control protocol/internet protocol (TCP/IP 
transport control . ' ^ ^ 

. step 201 the sender contacts the intended 

connection, in step -^^^ 

receiver ana requests a connection be setup w.th 
appropriate enapoint buHer size. The receiver then 

-^^ irs virtual address 

,,,„cates that size buffer region in its 

.pice finas or »a.es a free slot in the enapoint table, ana 
-.T-i i,-,,-„ith-the-buf-fer .base . aaare_ss^ana virtual to 

"Thp" receiver then acknowledges 

physical mapping information. The recei 

- - 3 multicast connection, this 

the connection in step 203. 

for each sender-receiver pair in the 
procedure is repeated for eacn 

■ ■ ^ offset from the base or 

nuiticast. Messages containing an offset 

u ^ent over the connection in step 204 

*he endpoint may then be sent over 

^ can be implemented without special-purpose 
This protocol can oe i'i'f-' 

.or ^roa-am on a commercially available 
hardware using a computer prog- am on 

I, interface. such an implementation has 
computer and network interface. 

using two DECStation 5000/240 workstations, 
been maoe using twu 

niaital Equipment Corporation. Each 
available from Digital f . ^ ^ 

h.d a Fore Systems TCA-100 ATM network interface 
workstation had a Fore y 

. ■ ro its ••TUKBOchannel" I/O bus and the two 
nluqqed into its j^^^ 

tations were connectea b.c. to baC, i.e., -ithout an 

1 The DHCStation .000/»OS haa a «1PS .3000 
ATM switch. Ane tv-v.- 

..Kbyte airect mappea off-chip instruction ana 

p.ccessor. "^^^ ,He .ain .e.ory 

.3ta caches, ana ^^^^ ^^.^^^ .u^eochannel , 
and I/O subsystems, inclua.ng 
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coerated at 2^K-iz. The Fore TCA-lOO was a very sample 
interface containing two FIFO queues, one for transmit and 
for receive, and some control and status registers. An 
cell was transmitted by the processor writing fourteen 
32-bit words representing the 5 bytes of ATM header. 48 bytes 
of AT« payload. and 3 bytes of padding over the TUHBOchannel 
rhe TCA-lOO interface. An ATM cell was received by the 
processor reading fourteen 32-bit .. words . Although the 
TUl^BOchannel supported DMA. the TCA-lOO did not use it. ^ The 
TCA-lOO either generated" a TURBOchannel interrupt when a cell 
arrived or a receive cell counter on the TCA-lOO was polled 

.,11.^ >,ave arrived. The data rate over 
to determine if any cells have arrivea. 

the fiber connecting the TCA-lOOs was KOMbps. The 
DECStations ran the Carnegie-Mellon University icm) 
r.crokernel-based .operating system Mach 3. 0,_(MK83. UX41). for 
Which full source code is readily available, including full 
source code, froni CMU of Pittsburgh. Pennsylvania 

s-.rr.ple remote write function was implemented on this 
syste. for experimentation purposes. In this implementation, 
a 32 byte block of data at a given offset from a endpoint in 
the sender was delivered to a sender supplied offset in a 
sender named endpoint in the receiver. The data, offset, and 
bu'fer information were packed into a single cell. The data 
block size was 32 bytes, for agreement with cache block 

rr^orc: tn be 32 byte block aligned 
sizes, restricting the offsets to De ^y^ 

for implementation ease. 

' The low level Mach kernel exception handling code was 

J - via an illegal instruction 

also modified to send a cell via an mey 

..ao on the receiving side, the Mach microkernel was 
modified to partly optimize the interrupt path. In 
particular, the TCA-lOO handler was called directly from the 
kernel interrupt trap handler. 
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Using the inplemencacion descrioec above, the following 
experiment «as performed. From user level a previously 
stored bloclc of data was sent from the source workstation to 
-he receiver. A user level process at the receiver was 
runr.ing. testing the endpoint area for the arrival of the 
data- Once-the data arrived, the 'receiver process sent it 
back to the source workstation where another user level 
process was running, testing for the arrival of the data. 
The total round trip time was then measured. This includes 
two one way remote writes, loop overhead, since the round 
trip was repeated twenty times, and measurement overhead. 
T,^e results are listed in Table 1, below, 

First iteration Average best 

Average send overhead 25 psec 3.4 ysec 

Average send to receive > 1msec, . 49.1 wsec 

Table 1: Remote write latency 

The first row is the average time to send a cell via the 
illegal instruction trap send. The second row is the round 
trio send-to-receive time corrected for loop and measurer-.ent 
overheads. The first iteration column lists the time on the 
first iteration, and the average best cola-nn lists the 
average of the times on 19 following iterations. After the 
first two iterations typically there was very little 
variation in the times. 

The first iteration incurs cache and 

translation-lookaside-buffer (TLB) misses which causes the 
send-to-receive time of this iteration to be much greater 
than that of subseqfuent iterations. (However, it is not 
clear why the send to receive time is so large on the first 
Iteration. The first iteration send overhead is much more 
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reasonable). After all cache and TLB misses and other 
transients have dissipated, the round trip time was 49.1 
psec, which means that the best case one-way 
send-to-receive tiri? for a remote' write was about 24.5 
psec, which is about 80 times faster than a similar test 
using Fore System's AAL3/4 implementation on the same 
hardware running Ultrix 4.3. 

The bandwidth was measured by .-sending a sequence of 
consecutive remote writes as found below in Table 2. 

Block size Bandwidth 

10 19 Mbps 

2 0 22 Mbps 

4 0 24 Mbps 

Table 2: Bandwidth achieved with remote write blocks 



The block size is the number of consecutive remote writes. 

The bandwidth increa'ses^'as ' the block size increases since 
the interrupt overhead is amortized over more data. The 
TCA-lOO interrupt handler reads cells until the receive FIFO 
e*T.?ties. Thus if cells are sent sufficiently close together, 
only one interrupt, and hence one path through the interrupt 
trap handler, may be required to read all the cells. Since 
sending a cell was fast with the illegal instruction trap 
method, this amortization effect is easily obtained, even 
when consecutive blocks are sent using an ordinary "for" 
loop. The asymptotic bandwidth achieved is thus a measure of 
the per cell overhead in processing a cell from the receive 
FIFO. 

The 24Mbps bandwidth is based on 32 data bytes per cell. 
Using the same hardware, others have reported a peak 
ba.ndwidth of about 48Mb?s using the 44 data bytes per cell 
AAL3/; format. This cell rate translates into a peak 
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bencvidth of about 35Mbps using 32 data bytes per cell. Thus 
the partly optimized single cell remote write implementation 
described above is obtaining about 68% of the practical peak 
bandwidth that can be obtained with 32 data bytes per cell. 
Even so. the 2<Mbps bandwidth is significantly better than 
the I4m)ps peak bandwidth measured for Fore System's AAL3/4 
implementation with the same hardware' running Ultrix 4.3. 

The _24,sec latency reported above breaks into the 
components shown in Table 3. ^ 

TUK30channel time psec 
ATM cell time 3 vsec 

CPU and memory time 14 »Jsec 

Table 3: Latency breakdown 

Since the TCA-100 ATM interface does not have DMA, all 
accesses to the interface to send and receive cells use 
programmed I/O. Programmed 1/0 is slow on the TURBOchannel . 
resulting in nearly one third of the latency. The ATM cell 
time refers to the time for 53 bytes (one cell) to completely 
collect at the receiver at the 140Mbps data transfer rate 
used by the fiber connection. The remainder of the 24 ,sec 
(14 psec) is. of course, the CPU and memory time at the 
sender and receiver, and is 68% of the total latency, 
consequently, optimizing the interrupt handler code by 20% 
vill only reduce the total latency by about three psec 
Thus, although there may be room for optimization in the 
current implementation, only diminishing returns would be 
Obtained. Thus, it is probably fair to conclude that 
approximately 20 .sec is the minimum latency for remote 
write with this common commercially-available hardware. 
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in view of these experimental results, special-purpose 

hardware should be added for protection and for direct 

depositing of data in memory to reduce the load on the 

processor, and to reduce memory and 1/0 bus time. This 

hardware support should reduce the " end to end latency to 

close to single cell time: 2.7psec.at a data rate of 155 

nbps. and .es.sec at a data rate of 622 ^^ps . The reduced 

load on the processor could also be '.important for a server. 

such as file server, in a distributed system that has a heavy 

^<«r, ir>art Also variation in latency is 'also 
communication loao. aiso, 

reduced and even the worse case latency can be good. 

such an implementation in hardware of an architecture of 
a network interface for supporting communication in 
accordance with this invention will now be described. 
Because a node typically may act .both as a receiver and 
sender, a network interface for a node should handle both 
sets of functions. The functionality of the receiver 52 will 
now be described and includes address mapping and protection, 
address registers, control for data and address paths, and 
flow control. The sender will be described below. 

The implementation has two parts. The first part is a 
front end architecture for address mapping. address 
registers, and control mechanism functionality. This part 
has multiple possible embodiments, each having generally the 
same functionality. A general block diagram is provided in 
Fig. 8, of which multiple embodiments are described below. 
The second part, the back end which connects the network 
interface to the host memory, also has multiple possible 
e^odiments. Three of these are also described below: a 
direct connection to the main memory, a traditional I/O bus 
connect, or a direct connection to the secondary cache. 



The front end architecture will now be described in 
connection with Fig. »■ The front end o£ the network 
interface is shown generally at 210. The front end connects 
„ the host .e.ory 212 via a bus 2U. The connection of 2H 
to host memory 212 is called the back end which will be 
aiscussed in n»re detail below. Trie front end 210, on the 
receive side, includes a receive buffer me,.ory with flow 
control 216. This receive buffer is preferably a £irst-in. 
first-out (FIFO) memory element. A header splitting ^ and 
.heclcin, unit 21B processes incomng cells and demultiplexes 
the information to VCI/VPI mapping unit 220, control decoder 
222 and data splitting and checking unit 224. The data 
splitting and checking unit passes blocks of data to block 
transfer unit 226, which can transfer data to the host memory 
across the back end 2H. The VCI/VPI mapping unit 220 
aete=r.ines . a connection number and applies it to operation 

1 tTiii hp described in more 

logic 230. The operation logic will be 

detail below in connection with Fig. 9. 

" control decoder 222 decodes the control portion of the 
incoming cell and determines an index and operand which are 
provided to operation logic 230. The index and operand are 
l^e sa:r.e values indicated in the ATH cell of Fig. 7. The 
control decoder 222 also outputs the opcode found in the AT« 
cell to a receive controller 226 which provides control 
information to the operation logic 230. The operation logic 

0. .t»uts, an address, state information, a condition code, a 
reply connection nu^er and interrupts. The address and 
.„.e-ruots go to the back end 21< whereas a reply connection 
-u.-*e- goes to the VCI/VPI mapping unit 232 on the send side 

1. the front end 210. Also on the send side of the front end 
,,0 is send controller 234 whic. receives information fro. 
-he host processor over bus 214 to provide control 
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infcroaticn ro the operation logic 23. and a connection 
purser to the VCI/VPI »appin9 -i- "2. The send controller 
includes send registers 23« from which opcode, operand 
index information are provided to a control encoder 238 
„.ich for.s this information into^the control portion of the 
MM cell described above in connectidn uith Fig. 7. 

, bloc, transfer, unit 2.0 on the ■ send side processes data 
the host memory into a data buf.fer_2.2. A cell forming 
24. . takes the connection information. control 

.r.A data and forms a cell with appropriate 

information, and data ano 

reader information which is then applied to a flow control 
and output buffer units 24S. The output buffer is preferably 
, riFO memory element. The message data n,ay alternatively . 
originate from data registers contained withii^ the send 

register 236^ 

;,side from the receive and send controllers 228 and 23. 
and -.he operation logic 230. the remaining functional blocks 
,.e standard for network interfaces or are relatively simple 
,„ function. Thus. detailed description of these >s 
oPitted. The operational logic 230 and receive and send 
controllers 228 and 23.. with a nun^er of e^odiments, will 
now be described. 

I r,ir~ 9^0 of Fiq. B will now be described 
The operation logic 230 or rig. 

in more detail in connection with Fig. 9. For ease of 
description and illustration, latches and control signals 
between elements in the figure have been omitted. The 
signals on the left-hand side of Fig. . come from the data 
nelds in the «« cell as described above in connection with 
For example, a connection n^er <conn., is derived 
uln. the VCI/VPI in the ATM cell header. 

The operation logic 230 includes a connection table 
.hich caches all or part of the connection table, described 



previously, which is stored by the operating syste,n. 
Accordingly, for each connection, this connection table UO 
contains an entry 1*2. which includes an endpoint nur^er X«. 
address register information, including a base U8 and a 
hounds limit 1.9, and connection ^ state information 150 and 
reply connection 151. The use of these fields will be 
described in more detail below. 

Anendpoint table 160 is" also provided which caches all 
or part of the endpoint table, described previously, which is 
maintained by the operating system. Accordingly, for each 
endpoint. this endpoint table 160 contains an entry 162. 
indexed by endpoint nun>ber. which includes an indication of 
the base 164 of the endpoint. its bounds 166 and address 
napping information 16B. The address mapping information 
refers to a page map structure for that endpoint which is 
stored in host memory. This endpoint information is in a 
separate table so that multiple connections to the same 
endpoint can share the same information. 

Address registers 170 are also provided. There are many 
options for implementing the address registers. Address 
registers are preferably private to each connection to limit 
P-Oblems with managing a possibly limited resource between 
Competing activities. Any address within an endpoint could 
serve as a register for address indirection. Unfortunately, 
such generality causes access speed and protection problems. 
The speed problem is that a message with indirect addressing 
requires a first host memory access in the critical path to 
determine the location to store a message and a second host 
.emorv access with post-increment mode to store the updated 
index. The protection problem is that the indirection 
location can contain any address. The protection problem can 
.e solved by translating this address, but then two address 
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„ansu.i=ns are »^ire. In .he critical path - =ne 

the aaaress cf in.i.ecUcn locavcn an. ano.h 
.„n.:ate .he a.a.ess in the inairecticn location. rcr 
,.ese reasons indirection is preferably restricted to a 
ae.icatea .a«ress registers per connection located 
r interface .e.cr.. This arrangement introduces the 

r..r.c sTjace but avoids host memory 
awkwardness of a_separate name space, t,ut 

Uses in - the critical path '.or direction. .ach 
connection has a n>^er of contiguous locations to for. a 
;.3ter -.inao... .or convenience and flexihil.tv. each 
connection is allowed to a.na.icall, allocate the window s..e 
3. connection set up ti.e. The base and bounds fxelds I S 
,„a in the connection table entry 1« Point to the 

,e,innin9 and end of this window respectively. This scheme 
;uows the overlapping and nestin, of register windows to 
effect different sharing and protection. 

one proble. witn usin, address registers -is that the 

■ k restrict the access a sender has to 

receiver may wish to restrict 

certain address registers. Tor example, the receiver might 
„ct Wish to allow the sender to have the direct ability to 
increment a pointer to a ^eue. For these purposes, the 
,,,,„s registers also include protection bits l» which 

«f access are protection 
rhese registers. Three types of access 

, .rite and indirect. Indirect access 
orovided for: read, write. 

. use this information in an operation 

allows a sender to use rni:. 

hout allowing the sender to determine the actual value^ 
Z example, an indirect address operation with postincrement 
aa the register ana increment it by a specified amount^ 
. not have permission to directly read 
even if the senaer aoes not have p 

c. write the register. - exception occurs if the e 
specifies an operation involving an address register 
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in a way not permitted by the access protection. The hosr 
processor can thereupon choose to access the register 
directly. 

Although address register protection allows a receiver to 
restrict sender access, the sender; _stil 1 names all of the 
operands in an operation. This lack' of isolation can lead to 
other types of protection problems-j For example, a sender 
could still give inconsistent operands,.. e,g,.. an operand 
priority and priority queues that do not match, or specify 
the wrong registers. To solve this problem, operand names 
are isolated from the sender by being accessible only to the 
receiver. 

To provide such isolation, the receiver could decode the 
operation to find the receiver operands or simply interrupt 
^e host "pressor (which incurs significant overhead). To 
allow sufficient flexibility in operand specification, this 
first alternative requires programmable control • at the 
receiver which increases complexity. Nevertheless, a fairly 
siiT^ple implementation of this alternative is presented 
later. To support the very common case of indirection with 
postincrement, the following special case is added to the 
previous primitives: 

effaddr= <addreg^>; 

addreg^ * <addreg^> + <addreg^^j>. 

TO use this special indirect addressing mode, the sender 
only specifies an index i in the operand. The receiver uses 
address register i for indirection and automatically uses the 
next register for the postincrement amount. 

The connection table 140. endpoint table 160. and address 
registers 170 may be allocated from memory in the network 
interface which typically is either a static random access 
memory (SRA.M.) or a dynamic random access memory (DR.^M) . 
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A conventional translation look-aside buffer (TLB) 180 is 
used to map. a.n a.ddress indicated by the message into a 
physical address (PA). The TLB is preferably a fully 
associative cache. The TLB matches;.on bits identifying the 
endpoint as well as on the virtual address (VA) since 
iultiple endpoints may have the same' VA. The TLB also stores 
an indication of the access rights to' the physical address. 

The connection table 140. endpoint table 160, address 
registers 170 and TLb'"i8o' are interconnected in the following 
manner. The endpoint number 146 output from the connection 
table 142 is used as the input to the endpoint table 160. 
The base 148 and bounds information 149 from the connection 
table are 140 fed respectively to an adder 156 and a 
con^.parator 158. The adder also receives the index from the 
received AT.M cell (128 in Fi,g. D and its output_ is also fed 
zo the comparator 158. The comparator acts as a filter 
through which a valid address to the address registers 170 is 
provided, otherwise an error trap occurs. 

The endpoint table 160 has its outputs for the base 164 
and bounds 166 connected respectively to an adder 172 and 
comparator 174. The adder 172 also receives an offset from 
the received ATM cell (in operand field 126 in Fig. 7). A 
multipie.xer 176 receives the output of the adder 172 the base 
164 from the endpoint table 160 and the offset from the 
received ATM cell. The output of the multiplexer 176 is 
applied to an input of an arithmetic logic unit (ALU) 178. 
The ALU also receives as another input a value read from the 
address registers 170. The outputs of the ALU 178 are a 
condition code which connects to the receive controller (228 
in Fig. 8) and a result which is connected to a demultiplexer 
179 and the address registers 170. The demultiplexer 179 
also receives as another input the output of adder 172. The 
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output of the demultiplexer 179 is applied to another input 
of the comparator 174. The output of the comparator 174 is 
either an error or an address within the endpoint range which 
is then input to the TLB 180. •• 

The receive controller 228 ' is used to control the 
multiplexer 176. ALU 178. demultiplexer 179. and read and 
write of the address register 170- in accordance with the 
state information 150 from the connection table 140. the 
opcode/control information from the received ATM cell (120 
and 122. respectively. . in Fig. 7). and the condition code 
from the ALU 178. There is great flexibility within the 
receive controller 228 with respect to features supported and 
the implementation of these features. If only the basic five 
addressing modes described above are implemented, and because 
these modes are very simple, the system implementing them can 
be hardwired via a finite state machine. Fig. 10 shows a 
sketch of such a table-driven implementation. The sender 
controller 234 could be realized in a manner analogous to the 
receive controller' 228. 

m Fig. 10. the condition code, state and opcode are 
respectively inputs 184, 186 and 188 which are used to index 
a table 190. The outputs to the table are control signals 
192 to the multiplexer, demultiplexer. ALU and latches (not 
shown), new state information 194 and a mask 196. The use of 
these outputs will be described in more detail below. In 
Figure 9. these outputs are all subsumed by the line labelled 
control from the receive controller to the operation logic. 

Although not shown in Fig. 9. there are data paths to the 
connection table 140. endpoint table 160. address registers 
no. and TLB 180 so that the host processor can read and 
..-ite their contents. This functionality is provided so the 
operating system can maintain the tables and TLB and so 
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„..ications ca„ access an. »a„ipuUce the a...ess 
/e^ss.ers. applicaUon access .he address 

.e,is«rs poses a protec-acn people, though. T-e easiest 
..l„.ion U to aeny direct access and force applications to 
the address registers only via operating syste. 
„ns. However, this solution .akes address register access 
,^„..e. «.other solution is to .ap the address registers 
into the application virtual address (VM space. That .s. 
,,,,ess registers of each connection are .apped to a 
different physical address range. Then see additional 
could extract the connection nun^er £ro. th.s 
p.,.ical address and use the address register base 148 and 
.ounds in the connection table 140 to access the 

.PPropriate region of the address registers 170. with this 

T reouired only to 

solution an operating system call as requa 

estahlish^he mapping. Thereafter an application can access 
,adress-register-s using the same address register base and 
bounds circuitry used by incoming messages. 

Although the endpoint base has been described as a VA. it 
.ould also be a physical address (PA), m fact, if the base 
a PA. the mapping can be simplified to merely a bounds 
cheC. however incurring tvo constraints. The first 
.onstraint is that an endpoint larger than a page >s 
.,,„cated on wired down consecutive physical pages. The 
second constraint is that the address registers can only 
contain relative addresses or PAS. otherwise VA to PA mapping 

■ =.1 The first constraint restrains use of 
is still required. The nc!"- 

.esi.ea. THe second constraint .eans t.at ea.het an 

nncr use full address pointers for indirection 
application cannot use rua 

• a svstem is invoked explicitly to map a 
or the operating system ^..^ess 
...e- -0 a PA before storing the pointer an an address 
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' The functioning of operation logic 230 wiU nov be 

oescr^bed. A connection nun^er (conn«) is derived from the 
„ess=ge vhich is used to index into the connection t.hle 
,,0. protection is provided hy insuring that the connection 

' a'- ^ a valid connection 

„uri>er is vithin the table size, i.e. .s a 

nuinber. for example by using the comparator 152. 
- The ihd-ex value fro. the AT« cell is added to the base 
fron,';h-e connection table 140 using adder 15.. This su. 
is then compared with the bounds 1.5 by comparator 156.^ H 
,ne sum is vithin the bounds, it is provided as an address to 
.be address registers 170. Otherwise, an error trap occurs. 
, value from the address register is thus obtained and may be 

applied to the ALU 178. 

The endpoint numb^ obtained from thejonne«ion table 

„0 is used to retrieve the "base l^I which is' a-dded to the 

„.tset value from the received MM cell using adder 172. 

.Ks su., is applied to the multiplexer 176 and demultiplexer 
The multiplexer is controlled so as to select one of 

.ne base le,, the sum from the adder 172 and the offset to be 
apolied to the ALU 17S. The ALU is controlled to perform 
operations on the value received from the address registers 
no and the output of the multiplexer to provide a ccnd.tron 
code and an address. The operations which can be performed 
bv the ALU 178 include, but are not limited to. operations on 
.adress register contents, such as adding for post-increment 
„„ae, logical operations and comparison, e.g., for 
oonditional interrupts. The ALU can be used for various 
..ner operations on the address register contents. Unle 
oonditional interrupts and operations on the add.e 
registers are to be restricted to addressing modes that 
.ot use the address registers, the address register memory 
Should be multi-ported or clocked faster than the rest o- 
operation logic 230. 
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The result provided by the ALU 178 may be applied to the 
address registers 170. The demultiplexer 179 is controlled 
so oS to select one of the sum from the adder 172 and xhe 
result from the ALU 178.. The result is compared with the 
bounds 166 for the endpoint using comparator 174 to add. to 
the protection described above. 
- "The final line of protection is' finding a valid mapping 
entry in ^he TLB 180 for the address output from the 
comparator 174. A TLB miss, either because the desired 
address mapping is not present in the TLB or because of 
inadequate access rights, causes an exception to be delivered 
to the operating system. While handling a TLB miss via an 
exception to the host processor uses little hardware, it has 
the disadvantage of blockijig further processing of incoming 
cells while the miss is serviced. These incoming cells 
preferably are throttled and_ buffered but may also be 
discarded. Alternatively the interface may either use 
hardware to service TLB misses or have dedicated mapping per 
endpoint to reduce or eliminate misses. The former choice is 
hardware intensive, inflexible, and still will delay cell 
processing because of delays in accessing the host memory for 
Snapping information. The latter choice might be workable if 
the endpoint table is expanded to contain one or two mapping 
entries per endpoint. However, such expansion leads to a 
space and performance tradeoff. 

The receive controller 228 decodes the opcode field 129 
and uses the condition code from the ALU 178 and the state 
information 150 from the connection table 140 to determine 
what to do with the incoming ATM cell. The control signals 
output by the receive controller 228 are determined according 
to the opcode and condition code and are used to effect a 
desired addressing mode and to store the data. The mask 



BNSOCXID; <GB „2301264A_(_> 



o..-put 19. (see Fig. lO) selects which elements, e.g. using a 
,_.,.e granularity, of the data are actually stored in the 
receiver .e.ory. For a 32 byte data blocK. the .as. can be 
taken directly from 4 bits of the opcode field. 

^tate 150 records connection addressing 
The connection state 

i„.o™aticr. acoss cells in a .uliipl. cell .essa,e. TKe 
.irst cell in a -uUiple ceil Sessa,e is a- ••conrrcl" cell 
^hat chooses the addressin, ~<ie and .specifies the offset and 
i„aex This infomation is stored in the state field of the 
connection tatle UO so that suhseguent cells for the sa.e 
connection can o.it the control information and thus carry 
„ore data. For example, the first cell could have 32 bytes 

aata and subse^ent cells could have .8 bytes of data 
,perh.o-s ix.-«L-5 .or-.at)... rach subsequent cell uses the 
„.,.ol infornation stored in the connection table state 
.ie>a The end of such a data cell sequence could be 
indicated either by storing a cell count in the state field, 

by using the standard format -herein the last cell in 

3..ch a secuence carries the length and a »C. Every cell is 
checked until a correct CRC is found. Assuming a multi-cell 
message terminates with a CRC, this scheme could transfer « 
,ata blocks Of 32 bytes each in U/3(«-a)l*2 cells for N . 1. 
„hich asymptotically achieves 3/2 the bandwidth compared to 
sending one 32 byte data block per cell. 

„,„v other functions as described above can be 
implemented by adding to the opcodes interpreted by the 

receive controller 228. 

;^y error trap to the operating system may be handled by 
....er discarding the cell or by inserting the cell in an 
eiror queue and signaling an exception to the operating 

svste.T.. 
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Having now described the receive portion of the fronr 
end. the send portion will now be described. Borh control 
information and data are provided :to send a cell. The 
control inforrnation comes from a set of send registers 236 
which each connection has for sending cells. The first three 
registers in this set store the control information, i.e.. 
the opcode, operand, and index, for "the cell. The data comes 
either from the endpoin^^alsociated!-' with the" connection or 
from a special block of . send registers, which is useful^ for 
formulating read requests. 

The set of send registers 236 for each connection also 
contains a status register, a mode register and a "go" 
register. The status register contains the number of cells 
sent. If initialized to 1. it will be set to 0 when the 
IntiYIi-ce actually sends the cell. This feature is useful to 
serve as an acknowledgment that a sent cell has actually left 
the interface since a cell might not be sent right away due 
to flow control or an exception. The mode register enables 
two variants of sending which will be described below. The 
go register is used to actually cause a cell to be sent as 
will be described below. » 

As discussed previously, numerous implementation options 
are possible for the front end. Several different 
embodiments are presented here. A first embodiment of the 
front end architecture in Fig. 9 uses minimal hardware. 
Address register operations are restricted to at most one 
address register read and write operation per cell. Endpoint 
pages are pinned in physical memory while the buffer is 
active. Remote reads are handled by the host processor, via 
an interrupt. The first restriction means that any 
com.b:nation of the primitive operations listed above acting 
on address registers acts on the same address register. 
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M.a.y opera-acns ot»in one ».,u.e« f.o. tKe 

message op „„,,ied by the message, vhich does 

,ea.essin, operations .s "J 

co.p>ete>y isoUte the .ece.vet state .. ^^^^ 
.Uernatively. the desited functionaUty can be « P_ 

, messaoes e.g.. the postincrement amount could 
, seguence of messages. g 

specified via a register add a 

,nd even restricts the nu.rt.er of small endp 
restrictions lead to a simple architecture. 

,ne receive side oi this eo^odiment is as described 
.hove, except that, because the endpoint pages are pinned in 
,,,, architecture, the endpoint table leo contains the 
.nysical address of the base of an endpoint. Ho-ever. since 

- - ~ " allocated contiguously, a 

phvsical pages are not necessarily allocate 

L -cache- Of address translation pairs is used to map .he 
.urn of an endPoint base address and offset, labeled "V." in 
„,„re , to the appropriate physical address. For endpo.nts 
Of one page or less is site, the P. may be stored directly in 
..e base field IM of the endpoint table entry 1«. A direct 
capping bit is added to all endpoint table entries le. to 
JL. interpretation of the base field, additional mapping 
„„,es could be added to the endpoint table UO to 
accotiCTodate larger endpoints. 

..e opera.icn of the send side is as follows. T..s., .he 
send endpoints are restricted to be integral page sizes. To 
lori.e sending .o a particular connection without a Kerne 
each outgoing connection .ro. an endpoint has a un.^ 

nd area at a fixed offset (in the hxgh 
virtual mapped command area 

o.der Virtual address .ics> fro. the endpoint virtual pag 

• . the endpoint. These connection cor«nand 
and the same size as the enap 

H.rf to the network interface. The seno 
paaes map uncached to 
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registers 236 for the connection are mapped into the page 
j..st below the base of the connection command region. 

TO send a block of data at an offset from the base of the 
source endpoint to the network via. connection C. a write 
operation is performed to the location at the same offset 
..cm connection C's command base. ■ The value written as 
ignored. After mapping, the low order physical address b.ts 
are the physical address (or an offset, as described later) of 
data block in the endpoint to send and the high order 
physical address bits contain the connection nun^er . The 
send controller extracts the control information from the 
send registers 236 of the connection and reads the data block 
to compose a cell- Alternatively, the data can come from a 
sr,ecial block of send registers 236 as discussed above. 
* wwsome host- -p-rocessors -there-may -not- -be sufficient 
Mts to encode both the physical address of the data block 
ard the connection number, as well as the send controller 234 
address. into a physical address. Two alternatives are 
possible in this such systems. The first alternative is to 
replace the physical address of the data block with the 
o'fset from the endpoint base. This requires the base 
address of an endpoint to be accessible to the send 
controller 234 and endpoint physical pages preferably to be 

Stored contiguously. 

The second aUe^tive is to use a right-shifted version 
the physical address of the data block. * very convenient 

addressed at 32 byte granularity, the least significant fve 
hits Of the physical address are unused anyvay. assuming a 
hyte-addressed machine. This shifting frees five bits in the 
Physical address for encoding the connection nu^er and send 
controller 23, address. Hc-ever. this right shifting has 
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repercussions. Since the p.ge offsec is nor Cnsea by 
„i„ual .e.ory capping, the endpoint co^^and virtual address 
,nd the data bloc, physical address are linked by the 
follo-in, constraint, the five .ost significant bits in the 
endpoint cc™.and virtual page offset correspond to the five 
least significant bits of the data bloc, physical page 
„u^er.- This constraint has three consequences. First, the 
endpoint co™.and-region-is-a -factor'. of 3J smaller than the 
endpoint: each entry in the con^and region now ..ps to the 
base of a data blocR. Conse,^ently. each block of contiguous 
3, pages in the endpoint has the same memory protection, 
second, endpoints are multiples of 32 pages in site. This 
constraint can be relaxed if the send controller 234 can 
.ccess the endpoint base and si«. Third, endpoints are 
mapped to contiguous chunks of 32 pages .ligned-...ith a 32 

page memory boundary- 

me mode register enables tvo variants of the send 
procedure. In the first, the operand is taken from the value 
ac-ually written to the connection command region. The 
second causes an exception if an attempt is made to send when 
me status field is non-zero. The "go" register is not used 

in this embodiment. 

«>y state operated on by the cell, e.g. write to address 
registers, is not updated until after the point of the last 
possible exception point, a TLB miss, for that cell. The 
„n is retained in the input FIFO and processing of further 
cells from that connection is blocked until re-enabled by the 

.»n, are only removed from the input FIFO 
host processor: cells are oniy 

v..-,en a cell -cc™.its- after the last exception point. 

,n the second embodiment, the three restrictions of the 
,irst embodiment are removed to obtain three ma.or 
enhancements: no pinning of endpoint pages, remote reacs 
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without interrupting the host processor, and a richer set or 
register and conditional operations. To support i;his first 
enhancement. the endpoint table :160 contains virtual 
addresses and the TLB maps from virtual addresses to physical 
addresses. This enhancement also introduces a new category 
of exceptions: page faults from references to paged out 
endpoint pages. These are treated as another class of 
exceptions and are . serviced by the . host processor, which 
maintains the main virtual mapping tables. The operating 
system is now responsible for keeping the mapping information 
consistent with the host memory state. 

To handle remote reads without the interrupting the host 
processor. the address mapping for the read data is 
computed. To do so, the receiver side, which does mapping, 
is reused -for -sending— i.e.- -the_:operation- logic 230 is 
multiplexed between the receiver side and the sender side. 
In this exJDodiment, in fact . the -operation logic 230 is used 
for all sends and not just sends requested by remote reads. 
Consequently it is no longer necessary to use the virtually 
mapped coirjnand pages for protection. However, in this 
etrJjodiment, the virtually mapped send registers are 
retained. To send a block of data at an offset from the base 
of source endpoint. the offset is written into the "go- 
register for the appropriate connection attached to that 
endpoint. This write causes a data block at the offset. 32 
byte block aligned, from the endpoint base to be read and 
composed into a cell with the control information in the 
opcode, operand, and index registers and sent via the 
associated connection. A multiple send mode is also added in 
which the status register is set to the number of cells to 
send starting at the specified offset. 
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in this embodiment, three primitive operations caa bS 
executed per message: address generation. register 
operations, and conditionals. A main opcode controls the 
selection and ordering of the primitive operations. Example 
opcodes are .read, read multiple, write, write multiple, and 
software exception which causes an interrupt to the host 
processor. The_ instruction format allows up to three 
different -register_.operands. to., be named in addition to an 
immediate operand. To accommodate all the accesses to^ the 
address register 170. the register operations are all triple 
clocked. Due to potential side effects, state recovery after 
exceptions as more complicated. 

It is sometimes preferable to have greater flexibility in 
,ntrcl functionality and cell interpretation than may be 
rovided by the f ir-st - two embodim_ents. For example, 
operations for locking or swap-and-compare may be desired. 
Also, it might be useful to customize the cell- level protocol 
for certain applications. Ultimately, flexibility could be 
ailable in the form of full programmability . which is 
iways a tradeoff between complexity and cost. Two ways to 
dd flexibility to the previous embodiments are to make the 
,ecewe and send controllers 228 and 234 writable, and to add 
a progran..able finite state machine for cell interpretation. 
The first non-header word of an ATM cell may index into a 
v-itable control memory that interprets the remaining fields 
o. the cell and sets all the control signals. This control 
.emory might have a number of con^on cell interpretations 
^..ew^.ed in and a number of programmable ones, perhaps even 
connection-dependent interpretations. However. 

..C'O-processor should also be considered for this purpose. 

,,,3 a third embodiment uses a conventional 
^..-oprocessor. in effect a communication co-processor, for 
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,he ertire front end. The mcroprocessci simplifies tne 
Hardware since of the internal control logic of the 

microprocessor is leveraged and its ovn TLB can be used as 
-he TLB ISO. The endpoint number may be incorporated .n the 
high address bits of the v.. Alsoi fully programmable cell 
interpretation and control may be obfained. 

- Although the third embodiment provides flexibility, even 
-a^-T55Mbps, -it may be difficult to use a microprocessor for 
everything. U ma.es more sense to use a microprocessor to 
interpret high level operations. For example. a 
Microprocessor could be particularly effective for relieving 
the host processor of responsibility for read operations, 
such a co,.-,unications co-processor could still benefit from 
hardwired address generation and -^^"^ " ''"'^ 

reduce the computltion-'burden". ~ 

fourth embodiment uses a pro9r.m,mable cell interpreter 
and so is a hybrid of the harTwire cohtrol of the first t«o 
embodiments and the microprocessor control of the third 
e.iodinent . Any complex decoding or operations are handled 
either a host processor or a co-processor. In this 
enbodiment, the various functional units retired are 
P-eferably implemented on a programmable gate array, using 
for the various tables. This basic functionality may 
even function as a front-end for a co-processor so that the 
co-processor would have a reduced load. This can be very 
nelpful to realize both high performance and flexibility at 
hioh date rates like 622Kbps or l.SGbps. 

•hi= these embodiments could easily be 

Where possible, tnese 

pipelined for high performance. However, at lS.Mbps data 
„'tes bytes of data arrive at about .onsec intervals, wh.ch 
should be long enough to complete the address and control 
setup. It may be necessary to buffer the data for about a 
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were or so to satisfy nemory hierarchy access times. At 
622Mb:.s only a few pipeline stages should be necessary. 

AS described above, any of these embodiments for the 
.ront end 210 is also connected to a "back end" 214 of the 
network interface for connecting the front end to the memory 
of its host. Three embodiments for "such a back end will now 

I 

be described. 

-in a f irst- embodiment, the front .end may connect directly 



to the main memory bus . A cache controller snoops this^ bus 

to ensure coherency. 

Alternatively. while a direct memory connection is 
attractive performance-wise, it is only an option to computer 
builders. An I/O bus interface such as the PCI bus would be 
accessible to a far greater market. The disadvantage of most 
I/O buses, including the PCI bus:^^s-delay in gaining-control 
of the bus due to the activity of other bus devices, thus 
some decree of on-board buf f er ing - is used -which adds to 
1 agency • 

^ AS another alternative, a direct cache interface could be 
,sed which does not require processor modifications. In this 
embodiment the network connects directly to external 
processor cache. This direct coupling of the network to a 
cache may reduce the copying of message data. To avoid 
unduly diluting this cache with network traffic, and thus 
neoatively impacting the performance of the processor to 
Which it connects, another embodiment couples the network to 
a separate message cache 252 as shown in Fig. 11. 

in Fig. 11. a message cache 252 is connected to the 
..cro-processor 250 via a bus 256. The message cache 252 is 
3.S0 connected to data cache 25. and main (or host) memory 
(not Show.) via a bus 258. A mapping unit 260 connects to 
.he message cache via bus 262 and to the network. 
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This message cache 252 is fully integrated into the 
mer7.ory hierarchy (as shown by connections 256 and 258). so 
there is no need to copy data fron, a message buffer to the 
meT.ory hierarchy before a process can access the data. The 
interface may be implemented at the secondary cache level, 
and thus no expensive, special purpose, or custom processor 
modifications are required. By restricting the data size of 
messages to be equal to the cache block size. e.g.. 32 bytes, 
cache blocks can be updated atomically. eliminating 
complicated and slow circuitry for updating partial c'ache 
blocks . 

A problem in this direct , cache interface in Fig. U is 
maintaining coherency between the message cache 252 and cache 
254 using a low overhead mechanism. • To solve this problem, 
the magnitude of the_ ,_in"herency ..problem -is reduced by 
allowing only the network to write into the message cache 
252. Then each write into the message cache 252 checks for 
and invalidates any blocks in the data cache 254 with a 
matching tag. The impact of this checking on the data cache 
254 is minimized by performing the checks on a shadow copy of 
the deta cache tags. Further details on such a direct cache 
interface may be found in a copending U.S. patent application 
entitled "Low Latency Network Interface", filed November 16, 
1993 by Randy B. Osborne. 

Numerous modifications and variations to the embodiments 
of the network interface can be made. For example, exception 
handling and flow control can be integrated in the front-end. 
for t.he following reasons. Exceptions caused by error traps, 
protection violations, unimplemented operations. TLB misses, 
and page faults, e.g.. host memory page faults, and possibly 
co.nnection and endpoint table page faults, can slow cell 
.recessing in this system and thus lead to a flow control 
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problem. cell processing can also be blocked to ensure 
atomicity during host processor accesses to this system 
..emory structures. For error traps, this problem can be 
avoided entirely by immediately discarding the offending 
cells or pushing it into a buffer overflow problem by putting 
such cells on an exception queue • for examination at the 
convenience of the operating system. Of course, discarding 
the of fending cells will not work 'for the other exception 
types because even if the offending cell is discarded and an 
implicit or explicit "retry" message is returned ■ to ' the 
sender, the exception condition still has to be repaired 
before forward progress can be made. In the meantime, 
incoming cells are discarded, buffered, or throttled. 

However, this only applies to incoming cells belonging to 
the- sane connection- affected- by the . exception condition, 
cells belonging to other connections can be processed once 
the exception condition is -saved. Cel-l-s- belonging to the 
exceptio.. incurred connection cannot be processed, even if 
they are not affected by the exception, i.e.. they do not 
cause a TLB miss or page fault, since some applications may 
depend on the guarantee of sender ordering of ATM cells. 
Sance this ordering guarantee is per channel, which maps to 
connection in this system, it is acceptable to continue 
processing cells which belong to different connections but 
share the same endpoint as an exception incurred cell. Of 
course, these cells may cause an exception. 

A strategy for dealing with exceptions is provided will 
now be discussed. First, the offending cell is removed fro. 
,.e front end as quickly as possible. Cells causing error 
t-aps can be discarded or queued. Other cells are retained 
.n the input FIFO. The connection is then marked as 
e.ceotioned. Flow control is invoked next to throttle 



senders transmitting further cells for that connection. In 
the meantime, any further cells which arrive for that 
connection are buffered. whereas other connections may 
continue processing cells. Only the first step is necessary 
for error traps. Global "exceptions", like blocking the 
system during host processor accesses, require throttling and 
buffering across all connections.' The throttling and 
buffering, are compatible with credit-based flow control 

schemes . ^ 

Hybrid address mapping could be used as an alternative to 
a large global TLB for mapping. That is. each endpoint 
buffer table entry could contain one or two mappings and the 
TLB could contain the rest. This hybrid mapping is a 
generalization of the idea, discussed above in connection 
with the first front end embodiment, of inserting the- mapping 
for single page endpoints directly in the endpoint table. 
The mappings in the endpoint table entries could be managed 
by the application so as to always contain mappings likely to 
be used soon. 

Locality in the connection and endpoint tables can also 
be managed. That is. with many connections and large 
endpoints, the respective tables might be large in size. One 
idea to reduce the table sizes required is to exploit 
locality by essentially "paging" out entries from the 
tables. On a page fault when accessing these tables, the 
interface can remove the page faulting cell from the cell 
processing pipeline and buffer and flow control further cells 
for the page faulted connection as described above. The host 
processor can then perform the actual "paging" of the tables, 
restore state, a.nd restart cell processing for the faulted 
connect ion . 
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structures can also be added to provide fair network 
access and to prevent network deadlock. That is. processes 
are prevented from interfering with each other either by 
blocking the network or by congesting the network and 
inducing deadlock in other processes. To ensure fairness 
some form of admission control could' be provided that limits 
the duration that one connection can send to the network if 
other connections have pending, non-flow controlled traffic. 
For performance reasons it as also a good idea to give 
priority to operating system traffic. Preventing deadlock 
requires several steps. First, each connection has 
independent flow control. Independent flow control per 
VCI/VPI if fairly standard in ATM networks and interfaces, 
second, any global exceptions that block processing of all 
cells have bounded duration. Third. cells requiring a 
response from the network are removed even though the reply 
connection may be flow controlled. This removal is made 
possible by reserving some buffer capacity for reply traffic, 
such as by allocating a separate VCI/VPI with associated flow 
control buffers, per connection just for reply traffic. 
Admission control should favor reply traffic over new traffic. 

To ensure at least the operating system can always make 
progress, it should also have its own connection. Any pages 
that the operating system might use. such as in the 
connection and endpoint tables and address registers, should 
be locked down to prevent unbounded deliy due to page 
thrashing. 

Global address registers may also be used in the front 
end. in the embodiments described above, address registers 
are currently private to each connection. This could make it 
inconvenient for several different connections to the same 
endooint to share a common queue. One way to solve this 
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« .napoin. »..e. ..an .ei„, s..icU. .oca. . 
;.„.ec.io„s. THe enapcin. taMe ccu.a contain .ase an 
.ounas .0 sue. ,.o.a. roisters in:..e sa.e ^ry as 

i 

local registers. 

P.agn,entation buffers can also Se provided in the front 
end That is. for .ulti-ceU .essa.es. .atc.in. the <B byte 
p3.Toad~^ W - .o^P^^^^ Of .•..e.orv and cache b oc 
Les leads to fragmentation problems. For example, w.th 32 
,y,e blocKs. there will be IS byte fragments. Both the 
sender and the receiver may .eep fragmentation buffers to 
...^ent and then reassemble 32 byte blocKs. or whatever the 

HO Mock Size is. However, it highly lakely 
memory and cache block size 

^KiP to leverage whatever segmentation 
that it will be possible to lever ag 

.-; T ~T^-aireadv is""in a standard high 

and reasseirJDly support there is already 

.andwidth interface for this fra^entation purpose. 

The sender and'receiver addresses" also may not be aligned 

r rn a menoty or cache block. Assuming that 
with respect to a memory 

,„U,aU,n.en. is t.e sa.e a. the senae. ana receive., an 
.nsnea .ice. can .e sen. ana Che portions chac shouia n=r be 

oci-.i.a.ion c. ceUs. tur sucn ™isali,^enc is U.e., CO 
„re. Anoche. possible performance penalty is chac some 
..Ciceccures .ay restrict subWoc. aaaressin,, so it ™.y be 

, • « or «:catter, operation 
, mack The masking, or scattet . 
selectively under a mask. me 

aone at tbe receiver. T.e senaer »ay also perform 
„.s..n,. but .ithoot a .atber capability it is optionai. . 

,e useful for security to prevent certain f.eias ..o. 
cut on tne network, but it aoes not improve efficiency. 
' n some cases tbe sena.n, aaaress may be unali.ea bot. 
...P.ct to bloc, bounaaries ana .tb resPect to tbe receiver 
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SHUtin,. in 3a.iUon « .».in,. is require. „ 
,e.; -it^ .his p.0Me.. TO sow. .his p.oW.». .U,n.e« 

sh-;fters can be added. 

. p.efe.ch ^eue couia aUo he, used in the front e„a 
.ec.use prefetching is useful for ^hidin, latency, .he syste. 
sL J .o.e .tu.. auea. sOppo.ts 

p.ovi.e co.pU» prefetching . a processor that en 
-„Jate^r-efe.tches, such as the ..C...«pha, .a. he use. a 
so.e hardware that can decode prefetch operations fro. th s 
„ and turn the. into rea: fetches. "- - ^ ^ 
L Possihle to put such hardware on a UO hus-.ased 
,„erface; hut ir could certainly he done, as it -as done .n 

ri-ta bus. Provided that a fetch can 
connects directly to the dota bus. , , . 

, the return address of the fetch is 

be initiated in- some way,-tne t« ^. . , 

3,„,,y Usred as the tail of a ^eue. The conditional 
::.:lpr scne^ecouU e.enheused to indicate when the 
caieue is nonempty. 

■ .,so, in so»e applicarions. it .ay desir.hle to provide 
nearer isolation he.ween the sender and the receiver . 
' ,„ .uecr deposit .odel presented ahove, the sender 
,,,ces an insrrucrion and operands directly into a message 
^ reference some state (in the 

operands may be inwediate, or reference 

, ■ th. receiver. The receiver executes 
add-ess registers) in the receiver 

; instruction. « described earlier, access restrictions 
the address registers provide see degree of protection 



i-hAre still may be inadequate 
^ A««i=-ion However, there stiA* 
and xsola.ion. ^^^^ ^ ^^^^^^ 

isolation for some applications due to 

™. all the instruction operands, even xf at 
still has to name all 



^^^^^ .^,,3 p,,,,,. 

:::;:,r;:.-.. 

' ^ . in receiver-based addressing syster.s . 

reg-airing isolation, as m recea 
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Another way to solve this problem is to allow the direct 
instruction (operation) in a message to be replaced by a 

^T, rhP receiver (an instruction 
pointer to an instruction in the receiver 

pointer). The instruction may directly reference receiver 
operands, e.g. in the address registers, without Knowledge of 
sender. Messages can still provide immediate operands 
.nd name receiver, operands though -the receiver may choose 
„ot to use these operands, -as- may. -b.e_ .necessary to maintain 

isolation. , 

one simple embodiment of this solution will now be 
p.esented. The principle modification to the receive side as 
described to this point is to add an instruction and operand 
buffer area as shown in Figure 13. The connection table 1.0 
now contains base and bounds entries 300 and 302, similar to 
that for address registers, for-access to an .instruction 
memory 304. To keep the scheme simple, each instruction is 
composed of an operation and operands in exactly the same 
format as in an ATM cell as described earlier. The operation 
controls from which location - the instruction memory 30. in 
the receiver or the message - operands are taken. Protection 
hits, similar to those for the address registers 170. allow 
the receiver to control which instructions serve as entry 

points to the sender. 

Three >re at le.st enhancements to this scheme. The 
f.rst is ,lcbal instructions. Kany connections are lively to 
Share the same operations, though on different operands. To 
.ccc,™oaate this expectation, a capability is added for 

.i«r,c ' These are instructions that are 
global instructions. inese 

globally accessible across all connections. They are 
constrained to operate only on sender-supplied operands to 
di.'icultv of operating on different receiver 

oDerands. 
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The second enhancement is separate instruction and 
operand memory. Instruction and operand memory could be 
separated at the cost of complexity to save memory storage. 

The third enhancement is providing multiple instructions 
per cell. A sequencer can be added to step the receiver 
through several instructions per cell. The first instruction 
in such a sequence serves as the entry point. 

It is easy to further add conditional sequencing 
operations -subroutines, etc. However, to keep the interface 
simple, more complex functionality is best obtained' by 
trapping to the host processor or by adding a co-processor, 
such as a micro-processor. 

Having now described a few embodiments of the invention, 
and some modifications and variations thereto it should be 
apparent to those skilled in the art the foregoing is 

merely illustrative"' and" not"limi~ting . having been presented 
by way . of example only. Numerous modifications and other 
erriodiments are within the scope of one of ordinary skill in 
the art and are contemplated as falling within the scope of 
the ir.vention as limited only by the appended claims and 
equivale.nts thereto. 
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CLAIMS 



1 A low overhead communication system for a computer system 
having a sender and a receiver interconnected by a network, 
wherein the receiver .has a network ^interface connected to a 
processor controlled by an operating system. the 
communication system comprising: 

means .-at-t-he-sender, for sending to the receiver a 
message which includes an operand, an indication of an action 
to be performed and a reference to information stored at^ the 
receiver; 

means, in the network interface, for receiving the 

message; 

n^eans. in the network interface and operative in response 
to receipt of the message, for determining whether the action 
is permitted to be performed at the receiver: and 

„,eans. in the network interface and operative when the 
action is permitted. for performing. separate from the 
processor and operating system- in the receiver, an operation 
on the operand in the message and the information stored at 
the receiver and for performing the action according to the 
operation . 

2. The communication system of claim 1, wherein the operand 
indicates an address in a memory in the receiver and wherein 
the means for performing an action includes n»eans for 
depositing the message at a location in the memory in the 
receiver according to the address in the message and the 
information stored at the receiver. 

3. The communication system of claim 2. wherein the operand 
in the message indicates an address register in the network 



interface storing an address in the receiver and the 
information stored at the receiver is the address stored in 
the address register, the conimunication system further 

comprising: 

means for obtaining the address from the indicated 

address register; and ' 

means for storing-the message- in the memory at the 

obtained address. 

4. The communication system of claim 3. wherein the operand 
further indicates and offset and the system further comprises 
the means for updating the address in the address register by 
the offset. 

5. The communication system of claim 1. wherein the means 
for performing an action comprises: 

means for comparing the operand to the information stored 

at the receiver; and 

means for generating an interrupt to the processor when 
the step of comparing indicates that the message requires 
immediate action. 

6. The communication system of claim 5V further'-'coiiiprising 
means for updating the information stored at the receiver. 

7. The communication system of claim 5, further comprising 
means for queueing the message when the means for comparing 
indicates that the message does not require immediate action. 

8. The communication system of claim 5. wherein the operand 
in the message indicates an offset and an address register in 
the network interface storing an address in the receiver and 
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the information stored at the receiver as tne addresB stored 

in the address register, the communication system further 
comprising: 

means for obtaining the address from the indicated 

i 

address register; and 

means for storing the message in' the memory at the offset 

from the obtained address. 



9. A low overhead communication system, comprising: 

a sender having a processor which requests that messages 
be sent and a network interface connected to the processor 
which forms a message, in response to a request from the 
processor, containing an operand, an indication of a desired 
action and a reference to information stored at the receiver; 

a recei ver having a processor _cont roll ed by an operating 
system and connected to a network interface; 

a network having connected between ' the^ network interface 
of the sender, and the network interface of the receiver and 
being adapted to communicate the message between the sender 

and the receiver; 

wherein the network interface of the receiver receives 
the message and which performs the action indicated by the 
message according to the operand and the information stored 
at the receiver only if the action is permitted to be 
performed by the receiver. 

10. The communication system of claim 9. wherein the operand 
indicates an address in a memory in the receiver and wherein 
the action is depositing the message at a location in the 
memory in the receiver according to the address in the 
message and the information stored at the receiver. 
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11. The communication system of claim 10, wherein the operand 
in the message indicates an address register storing an 
address in the receiver and the information stored at the 
receiver is the address stored in the address register, the 
communication system further, comprising in the receiver: 

means for obtaining the address from the indicated 



address register; and — ' - — 

means for storing the message .-.in the memory at the 

obtained address. ^ 

12. The communication system of claim 11, wherein the operand 
further indicates and offset and the communication system 
further comprises means for updating the address in the 
address register by the offset. 

13. The communication system of claim 9. wherein the receiver 
includes means for comparing the operand to the state 
maintained by the receiver, and means for generating an 
interrupt when the means for comparing indicates that the 
message requires immediate action. 

14. The communication system of claim 13, further comprising 
means for updating the information stored at the receiver. 

15. The communication system of claim 13. further comprising 
means for gueueing the message when the means for comparing 
indicates that the message does not require immediate action. 

16. The communication system of claim 13. wherein the operand 
in the message indicates an offset and an address register 
storing an address in the receiver and the information stored 
at the receiver is the address stored in the address 

-7C 
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register, the communication system further comprising, in the 
receiver: 

means for obtaining the address from the indicated 

address register; and 

'means for storing_the message in the , memory at the offset 

from the obtained address. 

17. A method for low overhead communication in a computer 
system having a sender and a receiver interconnected by a 
network, wherein the receiver has a network interface 
connected to a processor controlled by an operating system, 
the method comprising the steps of: 

sending a message from the sender through the network to 
the receiver, wherein the message includes an operand, an 
indication of an action to be performed and a reference to 
information stored at the receiver; 

receiving the message at the receiver; 

insuring that the action to be performed for the sender 
is permitted at the receiver; and 

if the action is permitted, performing, separate from the 
processor and operating system, the action at the receiver 
according to the operand in the message and the information 
stored at the receiver. 

18. The method of claim 17, wherein the operand indicates an 
address in a memory in the receiver and wherein the step of 
performing an action is the step of depositing the message at 
a location in the memory in the receiver according to the 
address in the message and the information stored at the 
receiver . 



19. The method of claim 18, wherein the operand in the 
message indicates an address register storing an address in 
the receiver and the information stored at the receiver is 
the address stored in the address register, the method 
TJither comprising-the steps of :- ! 

obtaining . the address from 'the indlcatee address 

register; and 

storm, the message .in the memory .at tHS" obtained address. 

20. The method of claim 19. "herein the operand further 
indicates and offset and the communication system further 
comprises the step of updating the address in the address 

register by the offset. 



21. The method of claim 17. wherein the step of performing an 
action comprises the steps of: 

comparing the operand to the information stored at the 

receiver; and 

generating an interrupt when the step of comparing 
indicates that the message requires immediate action. 

22. The method of claim 21. further comprising the step of 
updating the information stored at the receiver. 

23. The method of claim 21. further comprising the step of 
^ueueing the message when the step of comparing indicates 
that the message does not require immediate action. 

24 The method of claim 21. wherein the operand in the 
message indicates an offset and an address register storing 
an address in the receiver and the information stored at the 
receiver is the address stored in the address register, the 
com..unication systeni further comprising the steps of. 

"72. 
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obtaining the address from the indicated address 

register; and 

storing the message in the memory . at the offset from the 

obtained address. 

25. A cor«nunication system for low overhead communication in 
a network of multi-user compute r s, includi ng a sender and a 
receiver r"whereih the receiver includer^ Vo«^«sor and a 
r,emory and wherein a portion of the memory is assigned to an 
application program being executed by the processor/ the 
communication system comprising: 

means, in the sender, for sending across the network to 
the receiver a message which includes an operand, a reference 
to information stored at the r eceiver, a nd data for delivery 
to the application program; 

means, in the receiver and operative when a message is 
received, for directly depositing the data in the message at 
a location in the portion of memory assigned to the 
application program, wherein the location may be independent 
of locations of previous message data is determined according 
to the state of the sender and the information stored at the 
receiver ; and 

means, in the receiver and operative - when a ' message is 
received, for conditionally generating an interrupt to the 
processor according to the state of the sender and the 
information stored at the receiver. 

26. The communication system of claim 25. further comprising: 

means, in the receiver, for preventing ' access to the 
information stored in the receiver when the sender is not 
authorized by the receiver to access the information. 



15 
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27. The communication system of claim 25. further comprising: 

means, in the receiver and operative when a message is 
received, for changing the information stored in the receiver 
according to the state of the sender. 

2B. A network interface for a communication system for low 
overhead communication in a network 'of multi-user -computers , 
including a sender and a receiver, wherein the receiver 
includes a processor aind a memory and wherein a portion of 

the memory is assigned to an application program being 
executed by the processor, wherein the sender sends across 
the network to the receiver a message which includes an 
operand indicative of a state of the sender, a reference to 
information stored at the receiver, and data for delivery to 
the application program, the network interface comprising; 

means, operative when a message is received, for directly 
depositing the data in the message at a location in the 
portion of memory assigned to the application program, 
wherein the location may be independent of locations of 
previous message data and is determined according to the 
state of the sender and the information stored at the 
receiver and can be independent of locations used for storing 
previous messages; and 

means, operative when a message is received, for 
conditionally generating an interrupt to the processor 
according to the state of the sender and the information 
stored at the receiver. 

29. The network interface of claim 28, further comprising: 

means for preventing access to the information stored in 
the receiver when the sender is not authorized by the 
receiver to access the information. 

Is- 



30. The network interface of claim 28, further comprising: 
means, operative when a message is received, for 

changing the information stored in the receiver according to 
the stat e o f the sender, 

31. A low overhead communication system constructed and 
arranged to operate substantially as hereinbefore described 
with reference to and as illustrated in Figures 5 to 13 of 
the accompanying drawings. 

10 

32. A network interfac e for a low overhead communication 
system constructed and arranged to operate substantially as 
hereinbefore described with reference to and as illustrated 
in Figures 5 to 13 of the accompanying drawings. 

15 

33. A method for low overhead communication substantially 
as hereinbefore described with reference to and as 
illustrated in Figures 5 to 13 of the accompanying drawings. 
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